TILE PROCESSOR AND I/O DEVICE GUIDE FOR THE TILE-GX FAMILY OF PROCESSORS RELEASE 1.12 DOC. NO. UG404 OCTOBER 2014 TILERA CORPORATION Copyright © 2010-2014 Tilera Corporation. All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as may be expressly permitted by the applicable copyright statutes or in writing by the Publisher. The following are registered trademarks of Tilera Corporation: Tilera and the Tilera logo. The following are trademarks of Tilera Corporation: Embedding Multicore, The Multicore Company, Tile Processor, TILE Architecture, TILE64, TILEPro, TILEPro36, TILEPro64, TILExpress, TILExpress-64, TILExpressPro-64, TILExpress-20G, iMesh, TileDirect, TILExtreme-Gx, TILExtreme-Gx Duo, TILEmpower, TILEmpower-Gx, TILEmpower-Gx36, TILEmpower-Gx72, TILEncore, TILEncorePro, TILEncore-Gx, TILEncore-Gx9, TILEncore-Gx16, TILEncore-Gx36, TILEncore-Gx72, TILE-Gx, TILE-Gx9, TILE-Gx16, TILE-Gx36, TILE-Gx72, TILE-Gx8072, TILE-Gx3000, TILE-Gx5000, TILE-Gx8000, TILE-Gx8009, TILE-Gx8016, TILE-Gx8036, TILE-Gx3036, DDC (Dynamic Distributed Cache), Multicore Development Environment, Gentle Slope Programming, TMC (Tilera Multicore Components), hardwall, Zero Overhead Linux (ZOL), MiCA (Multicore iMesh Coprocessing Accelerator), and mPIPE (multicore Programmable Intelligent Packet Engine). All other trademarks and/or registered trademarks are the property of their respective owners. Third-party software: The Tilera IDE makes use of the BeanShell scripting library. Source code for the BeanShell library can be found at the BeanShell website (http://www.beanshell.org/developer.html). This document contains advance information on Tilera products that are in development, sampling or initial production phases. This information and specifications contained herein are subject to change without notice at the discretion of Tilera Corporation. No license, express or implied by estoppels or otherwise, to any intellectual property is granted by this document. Tilera disclaims any express or implied warranty relating to the sale and/or use of Tilera products, including liability or warranties relating to fitness for a particular purpose, merchantability or infringement of any patent, copyright or other intellectual property right. Products described in this document are NOT intended for use in medical, life support, or other hazardous uses where malfunction could result in death or bodily injury. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS IS” BASIS. Tilera assumes no liability for damages arising directly or indirectly from any use of the information contained in this document. Publishing Information: Document number: UG404 Release 1.12 Date 15 October 2014 Contact Information: Tilera Corporation Information info@tilera.com Web Site http://www.tilera.com Contents PREFACE About this Manual .................................................................................................................................. xxi Intended Audience ................................................................................................................................. xxi Manual Contents Description .............................................................................................................. xxi Related Documents .............................................................................................................................. xxiii Technical or Customer Support ........................................................................................................ xxiii Product Information ............................................................................................................................ xxiii Notation Conventions ......................................................................................................................... xxiii Conventions for Register Descriptions .............................................................................................xxiv Conventions for Processor Families ............................................................................................................. xxiv Byte and Bit Order .......................................................................................................................................... xxiv Reserved Fields ................................................................................................................................................. xxv Numbering ........................................................................................................................................................ xxv CHAPTER 1 I/O DEVICE INTRODUCTION 1.1 Overview ................................................................................................................................................ 1 1.1.1 Tile-to-Device Communication ................................................................................................................. 1 1.1.2 Coherent Shared Memory .......................................................................................................................... 2 1.1.3 Device Protection ........................................................................................................................................ 2 1.1.4 Interrupts ...................................................................................................................................................... 2 1.1.5 Device Discovery ......................................................................................................................................... 3 1.1.6 Common Registers ...................................................................................................................................... 3 CHAPTER 2 TILE PROCESSOR 2.1 System Architecture Overview .......................................................................................................... 7 2.2 Memory Architecture .......................................................................................................................... 8 2.3 Memory Addressing ............................................................................................................................ 9 2.3.1 TLB Management ........................................................................................................................................ 9 2.3.1.1 TLB Miss Handling ....................................................................................................................... 24 2.4 Memory Consistency Model ............................................................................................................ 26 2.4.1 Overview .................................................................................................................................................... 26 2.5 TILE-Gx Page Attribute Transitions and Cache Flushes ........................................................... 28 2.6 Protection ............................................................................................................................................. 29 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors iii Contents 2.6.1 Levels of Protection ................................................................................................................................... 29 2.6.2 Protected Resources .................................................................................................................................. 29 2.7 Interrupt Model ...................................................................................................................................29 2.7.1 Introduction ................................................................................................................................................ 29 2.7.1.1 Interrupt/Exception State ............................................................................................................ 30 2.7.1.2 Nested Interrupts/Exceptions ..................................................................................................... 31 2.7.1.3 Interrupt Traits ............................................................................................................................... 31 2.7.1.4 Interrupt Masks .............................................................................................................................. 32 2.7.1.5 INTCTRL and Protection of Interrupt Masks ............................................................................ 32 2.7.1.6 VLIW and Interrupts ..................................................................................................................... 33 2.7.2 Interrupt and Exception List .................................................................................................................... 34 2.7.3 Interrupt State, Control Registers, Double Faults, and IRET .............................................................. 35 2.7.3.1 Interrupt State and Control Registers ......................................................................................... 35 2.7.3.2 Double Faults ................................................................................................................................. 39 2.7.3.3 IRET ................................................................................................................................................. 40 2.7.4 Interprocessor Interrupt (IPI) .................................................................................................................. 40 2.7.5 Distributed Interrupt Processing ............................................................................................................ 40 2.7.6 Proxying Interrupts ................................................................................................................................... 41 2.7.7 Lower Protection Level Interrupts .......................................................................................................... 41 2.7.8 Downcalls ................................................................................................................................................... 41 2.8 Software-Visible Dynamic Networks ............................................................................................44 2.8.1 Overview .................................................................................................................................................... 44 2.8.1.1 Register Mapping and Interlock .................................................................................................. 44 2.8.1.2 Routing ............................................................................................................................................ 45 2.8.1.3 Demultiplexing .............................................................................................................................. 46 2.8.1.4 Receive-Side Buffering .................................................................................................................. 47 2.8.2 Ordering ...................................................................................................................................................... 47 2.8.2.1 Packet Format ................................................................................................................................. 47 2.8.3 Network Hardwall .................................................................................................................................... 48 2.8.4 Interrupts .................................................................................................................................................... 48 2.8.5 Deadlocks ................................................................................................................................................... 48 2.9 Special Purpose Registers (SPRs) ....................................................................................................49 2.10 Performance Counters / System Diagnostics ..............................................................................49 2.10.1 In-Tile System Devices ............................................................................................................................ 49 2.10.1.1 Tile Timer and AUX_TILE_TIMER ........................................................................................... 49 2.10.1.2 Cycle Counter ............................................................................................................................... 49 2.10.2 Events ........................................................................................................................................................ 49 2.10.3 Counters .................................................................................................................................................... 50 2.10.4 Watch Registers ....................................................................................................................................... 50 2.10.5 Pass SPR .................................................................................................................................................... 50 2.10.6 Broadcast Networks ................................................................................................................................ 50 2.10.7 System Software Debug .......................................................................................................................... 51 2.10.7.1 Tile Debug Port ............................................................................................................................ 51 2.10.7.2 Quiesce .......................................................................................................................................... 56 2.11 Boot Processes and Data Format ....................................................................................................56 iv Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Contents 2.11.1 Boot Flow .................................................................................................................................................. 56 2.11.2 Chip Modes and Reset Behavior ........................................................................................................... 57 2.11.3 Boot FIFO .................................................................................................................................................. 58 CHAPTER 3 DOUBLE DATA RATE SDRAM (DDR3) INTERFACE 3.1 Overview .............................................................................................................................................. 59 3.2 Interfaces .............................................................................................................................................. 60 3.2.1 DDR3 Interface .......................................................................................................................................... 60 3.2.2 Network Interface ..................................................................................................................................... 60 3.3 Data Flows ........................................................................................................................................... 60 3.3.1 QDN Memory Read Request Flow ......................................................................................................... 61 3.3.2 RDN Memory Read Response Flow ....................................................................................................... 61 3.3.3 QDN Memory Write Request Flow ........................................................................................................ 61 3.3.4 RDN Memory Write Response Flow ...................................................................................................... 61 3.3.5 Non-Cacheline Write Flow and Masked Write Flow ........................................................................... 61 3.4 Ordering ............................................................................................................................................... 62 3.4.1 Out of Order Dispatch .............................................................................................................................. 62 3.4.2 Out of Order Response ............................................................................................................................. 62 3.5 Addressing ........................................................................................................................................... 62 3.5.1 Memory Controller Striping .................................................................................................................... 63 3.5.2 DDR Address Mapping (from Memory Address Mapping) .............................................................. 63 3.5.3 Memory Rank/Bank Hashing ................................................................................................................. 64 3.5.4 Logical Rank and Physical Rank Mapping ........................................................................................... 64 3.6 Scheduler ............................................................................................................................................. 64 3.6.1 Memory Page Management Policy ......................................................................................................... 64 3.6.2 Memory Request Reordering .................................................................................................................. 65 3.6.3 Memory Command Reordering .............................................................................................................. 65 3.7 DIMM Support ................................................................................................................................... 65 3.7.1 Serial Presence-Detect EEPROM Support ............................................................................................. 66 3.7.2 Temperature Sensor .................................................................................................................................. 66 3.7.3 Address/Command Parity ...................................................................................................................... 66 3.7.4 RDIMM Control Word Access ................................................................................................................ 66 3.7.5 Memory PHY Training ............................................................................................................................ 66 CHAPTER 4 PCIE CONTROLLER ARCHITECTURE (TRIO) 4.1 Overview .............................................................................................................................................. 67 4.1.1 Communication and Data Transfer ........................................................................................................ 68 4.1.2 PHY Sharing ............................................................................................................................................... 68 4.2 MMIO Interface .................................................................................................................................. 69 4.3 PIO Communication .......................................................................................................................... 70 4.3.1 Memoryless Operation ............................................................................................................................. 70 4.3.2 Ordering ..................................................................................................................................................... 71 4.4 Push DMA ........................................................................................................................................... 71 4.4.1 Descriptors ................................................................................................................................................. 71 4.4.2 Request Partitioning ................................................................................................................................. 73 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors v Contents 4.4.3 Notification and Flow Control ................................................................................................................ 73 4.4.3.1 Descriptor Rings Slot Available Notification ............................................................................. 73 4.4.3.2 Transaction Complete Notification ............................................................................................. 73 4.4.3.3 PCI System Notification ................................................................................................................ 73 4.4.4 Flush/Fence ................................................................................................................................................ 73 4.5 Pull-DMA .............................................................................................................................................74 4.5.1 Pull DMA Notifications and Flow Control ............................................................................................ 75 4.5.2 Descriptor Rings Slot Available Notification ........................................................................................ 75 4.5.3 Transaction Complete Notification ......................................................................................................... 75 4.5.4 Request Tracker ......................................................................................................................................... 75 4.6 Flush/Fence ..........................................................................................................................................75 4.7 Address Translation ...........................................................................................................................75 4.7.1 I/O MMU ................................................................................................................................................... 76 4.8 Ingress Mapping Regions .................................................................................................................77 4.8.1 Tile Map Memory Regions ....................................................................................................................... 78 4.8.1.1 MAP-MEM Interrupts ................................................................................................................... 78 4.8.1.2 Map-Region Ordering ................................................................................................................... 79 4.8.2 Scatter Queue Regions .............................................................................................................................. 80 4.8.3 Boot and Rshim Regions .......................................................................................................................... 81 4.8.4 Map Fence ................................................................................................................................................... 81 4.9 Panic Mode ...........................................................................................................................................82 4.10 Connection to mPIPE .......................................................................................................................82 4.11 Deadlock .............................................................................................................................................84 CHAPTER 5 PCIE MAC INTERFACE 5.1 Introduction .........................................................................................................................................85 5.2 Register Spaces ....................................................................................................................................86 5.2.1 Type-0/1 and Virtual Function Configuration Space .......................................................................... 87 5.3 Port Configuration ..............................................................................................................................88 5.4 IO Address Mapping .........................................................................................................................88 5.4.1 Boot and Diagnostics Access ................................................................................................................... 88 5.5 Interrupts ..............................................................................................................................................88 5.6 Power Management ............................................................................................................................88 5.7 Link Down Handling .........................................................................................................................89 5.8 SERDES Configuration .....................................................................................................................89 5.9 Streaming Interface ............................................................................................................................89 5.9.1 Packetization .............................................................................................................................................. 90 5.9.2 Interrupts .................................................................................................................................................... 90 5.9.3 Flow Control .............................................................................................................................................. 90 CHAPTER 6 MPIPE ARCHITECTURE 6.1 Overview ..............................................................................................................................................91 6.1.1 Glossary ...................................................................................................................................................... 91 6.1.2 PHY and DMA Sharing ............................................................................................................................ 92 6.1.3 Channelization ........................................................................................................................................... 92 vi Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Contents 6.1.4 Channels vs. Ports ..................................................................................................................................... 93 6.1.5 Priority Queues .......................................................................................................................................... 93 6.1.6 Communication Model ............................................................................................................................ 93 6.2 Ingress Services .................................................................................................................................. 93 6.2.1 Typical Ingress Flow ................................................................................................................................. 93 6.2.2 Buffers ......................................................................................................................................................... 94 6.2.2.1 Buffer Stacks ................................................................................................................................... 95 6.2.2.2 Buffer Chaining .............................................................................................................................. 96 6.2.2.3 Buffer Release ................................................................................................................................. 98 6.2.2.4 Buffer Stack Engine ....................................................................................................................... 99 6.2.3 iDMA Packet Descriptors ....................................................................................................................... 100 6.2.4 Notification Rings ................................................................................................................................... 102 6.2.5 Store-and-Forward vs. Cut-Through .................................................................................................... 103 6.2.6 Classifier ................................................................................................................................................... 103 6.2.6.1 Parallel Processing ....................................................................................................................... 104 6.2.6.2 Cycle Budget ................................................................................................................................ 104 6.2.7 Processor Architecture ............................................................................................................................ 105 6.2.7.1 Header and Descriptor ............................................................................................................... 106 6.2.7.2 Table Lookup ............................................................................................................................... 106 6.2.7.3 Special Registers .......................................................................................................................... 106 6.2.7.4 Hash Accumulator ...................................................................................................................... 107 6.2.7.5 Endianness .................................................................................................................................... 107 6.2.7.6 Header/Descriptor Valid Indicators ........................................................................................ 107 6.2.7.7 Classifier Pipeline ........................................................................................................................ 108 6.2.7.8 Stalls ............................................................................................................................................... 108 6.2.7.9 Persistent State ............................................................................................................................. 109 6.2.7.10 Exceptions ................................................................................................................................... 110 6.2.7.11 Classifier Configuration ........................................................................................................... 110 6.2.7.12 Classifier “Blast” Re/Programming ....................................................................................... 110 6.2.7.13 SPRs ............................................................................................................................................. 112 6.2.7.14 Classifier Tools ........................................................................................................................... 112 6.2.8 iDMA Engine ........................................................................................................................................... 112 6.2.8.1 Temporal Hints for iDMA Writes ............................................................................................. 113 6.2.9 Load Balancer .......................................................................................................................................... 114 6.2.9.1 BucketSTS ..................................................................................................................................... 114 6.2.9.2 Notification Groups ..................................................................................................................... 114 6.2.9.3 Notification Ring Arbitration ..................................................................................................... 115 6.2.9.4 Load Balance Override Flows .................................................................................................... 117 6.2.10 Checksum ............................................................................................................................................... 118 6.2.11 Notification ............................................................................................................................................ 118 6.2.11.1 Tail Pointer Updates – Polling Model .................................................................................... 119 6.2.11.2 Notification Interrupts .............................................................................................................. 119 6.2.11.3 Timestamp and Sequence Number Information .................................................................. 120 6.2.12 Counters ................................................................................................................................................. 120 6.2.13 Software Override Flows ..................................................................................................................... 120 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors vii Contents 6.2.13.1 Software Classification .............................................................................................................. 120 6.2.13.2 Software Load Balancing .......................................................................................................... 121 6.2.13.3 Software Buffer Management .................................................................................................. 121 6.3 Ingress Channel Flow Control .......................................................................................................121 6.4 Packet Drops ......................................................................................................................................122 6.4.1 Drop/Truncate: iPkt Full ....................................................................................................................... 122 6.4.2 Drop: Classifier Cycle-Budget ............................................................................................................... 122 6.4.3 Drop: Classifier Program ........................................................................................................................ 122 6.4.4 Drop: NotifRing Full ............................................................................................................................... 122 6.4.5 Drop: Bucket Count Full ......................................................................................................................... 122 6.4.6 Drop/Truncate: Out of Buffers ............................................................................................................. 123 6.5 Egress Services ..................................................................................................................................123 6.5.1 Typical Egress Flow ................................................................................................................................ 123 6.5.2 eDMA Packet Descriptors ...................................................................................................................... 124 6.5.2.1 eDMA Descriptor Fetch .............................................................................................................. 125 6.5.2.2 eDMA Descriptor Hunt Mode ................................................................................................... 125 6.5.2.3 Explicit eDMA Descriptor Post .................................................................................................. 126 6.5.2.4 eDMA Descriptor Ring Reordering .......................................................................................... 126 6.5.2.5 Descriptor Prefetch and Memory Ordering ............................................................................. 127 6.5.2.6 Descriptor-Write and Descriptor-Post Ordering .................................................................... 127 6.5.2.7 Ring to Channel Mapping .......................................................................................................... 127 6.5.2.8 Descriptor Errors .......................................................................................................................... 128 6.5.3 Buffers ....................................................................................................................................................... 128 6.5.3.1 Chaining ........................................................................................................................................ 128 6.5.3.2 Descriptor-Based Gather ............................................................................................................. 128 6.5.3.3 Transaction Sizing and Buffer Offsets ...................................................................................... 129 6.5.3.4 Buffer Release ............................................................................................................................... 129 6.5.3.5 Egress VA Translations ............................................................................................................... 130 6.5.4 eDMA Engine ........................................................................................................................................... 130 6.5.5 ePkt Buffering .......................................................................................................................................... 130 6.5.6 Notifications ............................................................................................................................................. 130 6.5.6.1 Descriptor Ring Head .................................................................................................................. 130 6.5.6.2 Descriptor Complete Interrupt and Counter ........................................................................... 131 6.5.7 Checksum ................................................................................................................................................. 131 6.5.7.1 eDMA Checksum Buffer Limitations ....................................................................................... 131 6.5.8 Egress Picker ............................................................................................................................................ 132 6.5.8.1 Egress Priority Arbitration ......................................................................................................... 132 6.5.8.2 Egress Priority Flow Control ...................................................................................................... 132 6.5.9 Special Flows ............................................................................................................................................ 133 6.5.9.1 NoSend Option ............................................................................................................................ 133 6.5.9.2 Size=0 Option ............................................................................................................................... 133 6.5.9.3 eDMA Loopback .......................................................................................................................... 133 6.6 Virtual Memory .................................................................................................................................133 6.6.1 I/O TLB Details ....................................................................................................................................... 134 6.7 PA Distribution .................................................................................................................................135 viii Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Contents 6.7.1 Locality Hints ........................................................................................................................................... 135 6.7.2 Pinning ...................................................................................................................................................... 136 6.8 MMIO ................................................................................................................................................. 136 6.8.1 MAC Configuration Registers ............................................................................................................... 136 6.8.2 Service Domains ...................................................................................................................................... 137 6.9 Interrupts ........................................................................................................................................... 138 6.10 UserIO .............................................................................................................................................. 139 6.11 Flush Mechanisms ......................................................................................................................... 139 6.11.1 MMIO Access Drain .............................................................................................................................. 139 6.11.2 NotifRing Drain ..................................................................................................................................... 139 6.11.3 Ingress Channel Drain .......................................................................................................................... 140 6.11.4 EDMA Ring Drain ................................................................................................................................. 140 CHAPTER 7 XAUI MAC INTERFACE 7.1 Introduction ....................................................................................................................................... 143 7.1.1 Features ..................................................................................................................................................... 143 7.2 Register Spaces ................................................................................................................................. 143 7.3 MAC and Channel Mapping ......................................................................................................... 144 7.4 Port Configuration ........................................................................................................................... 144 7.4.1 Lane Sharing with SGMII ....................................................................................................................... 145 7.5 Flow Control ...................................................................................................................................... 145 7.5.1 Priority-Based Flow Control .................................................................................................................. 145 7.6 Interrupts ........................................................................................................................................... 145 7.7 Timestamping and IEEE 1588 ........................................................................................................ 145 7.8 MDIO ................................................................................................................................................. 151 7.9 Statistics ............................................................................................................................................. 151 7.10 Filtering ............................................................................................................................................ 151 7.10.1 Type ID Checking ................................................................................................................................. 152 7.10.2 Broadcast Address ................................................................................................................................ 152 7.10.3 Hash Addressing ................................................................................................................................... 153 7.11 Special Modes ................................................................................................................................. 153 7.11.1 Pass All Frames Mode .......................................................................................................................... 153 7.11.2 Custom Preamble .................................................................................................................................. 153 7.11.3 Short IPG ................................................................................................................................................ 153 7.12 SERDES Control ............................................................................................................................. 154 7.13 LEDs .................................................................................................................................................. 154 CHAPTER 8 SGMII MAC INTERFACE 8.1 Introduction ....................................................................................................................................... 155 8.1.1 Features ..................................................................................................................................................... 155 8.2 Register Spaces ................................................................................................................................. 155 8.3 MAC and Channel Mapping ......................................................................................................... 156 8.4 Port Configuration ........................................................................................................................... 156 8.4.1 Lane Sharing with XAUI ........................................................................................................................ 156 8.5 Flow Control ...................................................................................................................................... 157 8.5.1 Priority-Based Flow Control .................................................................................................................. 157 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors ix Contents 8.6 Interrupts ............................................................................................................................................157 8.7 Timestamping and IEEE 1588 .........................................................................................................157 8.8 MDIO ..................................................................................................................................................158 8.9 10/100Mbps Support .........................................................................................................................158 8.10 Half-Duplex Support .....................................................................................................................158 8.11 Energy Efficient Ethernet Support (IEEE 802.3az) ....................................................................158 8.11.1 802.3az Operation .................................................................................................................................. 158 8.11.2 LPI Operation in the MAC ................................................................................................................... 159 8.12 PCS Auto-Negotiation ...................................................................................................................159 8.12.1 PCS Collision Detect and Carrier Sense ............................................................................................. 160 8.12.2 Link Status .............................................................................................................................................. 160 8.13 Statistics ............................................................................................................................................160 8.14 Filtering ............................................................................................................................................160 8.14.1 Type ID Checking .................................................................................................................................. 161 8.14.2 Broadcast Address ................................................................................................................................. 162 8.14.3 Hash Addressing ................................................................................................................................... 162 CHAPTER 9 TILE-GX INTERLAKEN INTERFACE 9.1 Overview ............................................................................................................................................163 9.1.1 Channel Mapping .................................................................................................................................... 163 9.2 TX Interface ........................................................................................................................................163 9.2.1 Burst Scheduler ........................................................................................................................................ 164 9.2.2 Packet vs. Burst ........................................................................................................................................ 164 9.3 RX Interface .......................................................................................................................................164 9.4 Flow Control ......................................................................................................................................164 9.4.1 Link Level TX Flow Control .................................................................................................................. 165 9.4.2 Channel-Based Flow Control ................................................................................................................. 165 9.4.3 Link Level RX Flow Control .................................................................................................................. 165 9.4.4 Out-of-Band Flow Control ..................................................................................................................... 165 9.5 Statistics ..............................................................................................................................................165 9.6 Initialization ......................................................................................................................................166 9.7 Error Handling ..................................................................................................................................166 CHAPTER 10 USB INTERFACE 10.1 Overview ..........................................................................................................................................167 10.2 External I/O Interface .....................................................................................................................168 10.3 Mesh Interface .................................................................................................................................168 10.3.1 MMIO Interface ..................................................................................................................................... 168 10.3.2 Memory Access ...................................................................................................................................... 169 10.3.3 Interrupt Interface ................................................................................................................................. 170 10.4 Host Controller ................................................................................................................................170 10.5 Device Endpoint .............................................................................................................................170 10.5.1 Configuration ......................................................................................................................................... 170 10.5.2 MAC Design ........................................................................................................................................... 170 10.5.3 MAC Interrupts ..................................................................................................................................... 171 10.5.3.1 Device Interrupts ....................................................................................................................... 171 x Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Contents 10.5.3.2 Endpoint Interrupts ................................................................................................................... 171 10.6 Standalone Device Operation ...................................................................................................... 171 10.6.1 Interface and Endpoint Configuration ............................................................................................... 172 10.6.2 Boot/Debug Interface ........................................................................................................................... 172 10.6.3 Tile-Monitor Interface ........................................................................................................................... 173 CHAPTER 11 COMMON ACCELERATOR INTERFACE (MICA) 11.1 Introduction ..................................................................................................................................... 175 11.2 Overview and Major Functional Blocks .................................................................................... 176 11.2.1 Major Blocks ........................................................................................................................................... 177 11.2.1.1 Mesh Interface ............................................................................................................................ 177 11.2.1.2 MMIO Registers and Context State ......................................................................................... 177 11.2.1.3 Operand Data Specification ..................................................................................................... 181 11.2.1.4 TLB (Translation Lookaside Buffer) ........................................................................................ 183 11.2.1.5 Engine Scheduler ....................................................................................................................... 184 11.2.1.6 Function Specific Engines ......................................................................................................... 184 11.2.1.7 DMA Channels .......................................................................................................................... 184 11.2.1.8 PA to Header Generation ......................................................................................................... 184 11.3 Operation Flow ............................................................................................................................... 184 11.3.1 General Flow .......................................................................................................................................... 184 11.3.2 Tile Interrupts ........................................................................................................................................ 185 11.3.3 Specific Use Examples .......................................................................................................................... 186 11.3.3.1 General Use ................................................................................................................................ 186 11.3.3.2 TLB Miss ..................................................................................................................................... 186 11.3.3.3 Deferred Interrupts ................................................................................................................... 187 11.3.3.4 Pause Context ............................................................................................................................. 187 11.3.3.5 TLB Probe ................................................................................................................................... 188 11.3.3.6 TLB Shootdown ......................................................................................................................... 188 11.3.3.7 Terminate Operation for a Specific Context .......................................................................... 188 CHAPTER 12 CRYPTOGRAPHIC ACCELERATOR INTERFACE 12.1 Engines ............................................................................................................................................. 191 12.2 Schedulers ....................................................................................................................................... 191 12.3 Contexts ............................................................................................................................................ 191 12.4 Engine-Specific Details ................................................................................................................. 191 12.4.1 Memory-to-Memory Copy Engine ..................................................................................................... 191 12.4.1.1 Usage Constraints for the Engine ............................................................................................ 191 12.4.2 Crypto Packet Processor ....................................................................................................................... 192 12.4.2.1 Usage Constraints for the Crypto Packet Processor Engine ............................................... 192 12.4.3 KASUMI and SNOW-3G Engine ........................................................................................................ 193 12.4.3.1 KASUMI Engine ........................................................................................................................ 193 12.4.3.2 SNOW-3G Engine ...................................................................................................................... 194 12.4.3.3 Usage Constraints for the Engine ............................................................................................ 195 12.4.4 Public Key Accelerator Engine ............................................................................................................ 195 12.4.4.1 Descriptor Ring Management .................................................................................................. 196 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors xi Contents 12.4.4.2 Command Descriptor Contents ............................................................................................... 197 12.4.4.3 Result Descriptor Contents ....................................................................................................... 197 12.4.4.4 Interrupts .................................................................................................................................... 197 CHAPTER 13 COMPRESSION ACCELERATOR INTERFACE 13.1 Overview ..........................................................................................................................................199 13.2 Data Flows ........................................................................................................................................199 13.2.1 Typical Compression Flow .................................................................................................................. 199 13.2.2 Typical Decompression Flow .............................................................................................................. 200 13.3 Compression Engine ......................................................................................................................200 13.3.1 Engine Configuration ........................................................................................................................... 200 13.3.2 GZIP Handling ...................................................................................................................................... 201 13.4 Decompression Engine ..................................................................................................................201 13.4.1 GZIP Handling ...................................................................................................................................... 201 13.5 Memory-to-Memory Copy ............................................................................................................201 13.6 API .....................................................................................................................................................201 13.6.1 Context Registers ................................................................................................................................... 202 13.6.2 Compression/Decompression Engine Registers .............................................................................. 202 13.6.3 Status Registers ...................................................................................................................................... 203 13.6.4 Transaction Size ..................................................................................................................................... 203 13.6.5 Data Expansion Handling .................................................................................................................... 203 13.6.6 Performance Counter ............................................................................................................................ 204 CHAPTER 14 FLEXIBLE I/O INTERFACE 14.1 Overview ..........................................................................................................................................205 14.2 Virtualization and Protection Support .......................................................................................205 14.3 MMIO Register Map ......................................................................................................................206 14.4 Interrupts ..........................................................................................................................................206 14.5 I/O Pin Driver Configuration .......................................................................................................206 14.6 I/O Pin Clocking Control ..............................................................................................................206 14.7 Pin Control and Data Accesses ....................................................................................................207 14.8 Reset/Initialization .........................................................................................................................207 14.9 Performance .....................................................................................................................................207 CHAPTER 15 RSHIM INTERFACES 15.1 Level-1 Boot .....................................................................................................................................209 15.2 I/O Discovery ...................................................................................................................................209 15.3 tile-monitor FIFOs ..........................................................................................................................209 15.4 Down-Counters and Watchdog ....................................................................................................210 15.5 Rshim JTAG .....................................................................................................................................210 15.6 Reset Control ...................................................................................................................................210 15.7 Byte Access Interface ......................................................................................................................210 15.8 Remote Interface Access and Device Protection .......................................................................210 xii Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Contents CHAPTER 16 UART INTERFACES 16.1 UART Interface ............................................................................................................................... 211 16.1.1 Overview ................................................................................................................................................ 211 16.1.1.1 Protocol Mode ............................................................................................................................ 211 16.1.2 Data Flows .............................................................................................................................................. 213 16.1.2.1 Receiving Data ........................................................................................................................... 213 16.1.2.2 Transmitting Data ...................................................................................................................... 213 16.1.3 Flow Control .......................................................................................................................................... 213 16.1.4 Master Arbitration ................................................................................................................................. 213 16.1.5 8/64 Bits Handling ................................................................................................................................ 214 16.1.5.1 Remote UART Writes ................................................................................................................ 214 16.1.5.2 Remote UART Reads ................................................................................................................ 214 16.1.6 Error Handling and Interrupts ............................................................................................................ 214 16.1.7 UART Controller Registers .................................................................................................................. 214 CHAPTER 17 I2C MASTER INTERFACE 17.1 Overview .......................................................................................................................................... 215 17.1.1 I2C Master Boot Options ...................................................................................................................... 215 17.1.2 Boot ROM Format ................................................................................................................................. 216 17.1.3 Boot Operations ..................................................................................................................................... 217 17.2 Usage Model .................................................................................................................................... 217 17.2.1 Generic Operation ................................................................................................................................. 217 17.2.2 Software Instructions ............................................................................................................................ 220 17.2.3 I2C EEPROM Page Mode ..................................................................................................................... 222 17.2.4 Error Handling and Interrupts ............................................................................................................ 222 17.3 Registers ........................................................................................................................................... 222 CHAPTER 18 I2C SLAVE INTERFACE 18.1 Overview .......................................................................................................................................... 223 18.2 Usage Model .................................................................................................................................... 223 18.2.1 Data Flows .............................................................................................................................................. 223 18.2.2 Direct-Addressing ................................................................................................................................. 224 18.2.3 No-Address Access ............................................................................................................................... 224 18.2.4 8 Bits / 64 Bits Handling ...................................................................................................................... 224 18.2.5 Acknowledge Control ........................................................................................................................... 225 18.2.6 Access Arbitration ................................................................................................................................. 225 18.2.7 Error Handling and Interrupts ............................................................................................................ 225 CHAPTER 19 SPI INTERFACE 19.1 Overview .......................................................................................................................................... 227 19.1.1 Boot Options .......................................................................................................................................... 227 19.1.2 Boot ROM Format ................................................................................................................................. 227 19.2 Usage Model .................................................................................................................................... 229 19.2.1 Boot Operation ....................................................................................................................................... 229 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors xiii Contents 19.2.2 SPI Flash Operations ............................................................................................................................. 229 19.2.2.1 SPI Flash Instructions ................................................................................................................ 229 19.2.2.2 SPI Configurable Instruction Sets ............................................................................................ 230 19.2.2.3 SPI Flash Unknown Instruction ............................................................................................... 230 19.2.2.4 SPI Flash Deep Power-Down ................................................................................................... 230 19.2.2.5 SPI Flash Write In-Progress ...................................................................................................... 230 19.2.2.6 SPI Flash Write Protection ........................................................................................................ 230 19.2.2.7 SPI Flash Page Mode ................................................................................................................. 231 19.2.2.8 SPI Flash Interface ...................................................................................................................... 231 19.2.2.9 Software Command Sequences to Execute an SPI Flash Instruction ................................. 231 19.2.2.10 Interface Timing ....................................................................................................................... 233 19.2.3 Rshim Interface ...................................................................................................................................... 234 19.2.3.1 Rshim Register Interface ........................................................................................................... 234 19.2.3.2 Rshim Host Interface ................................................................................................................. 234 19.2.3.3 Error Handling and Interrupts ................................................................................................ 234 APPENDIX A JTAG INTERFACE APPENDIX B CLASSIFIER INSTRUCTIONS AND SPRS B.1 Classifier Instructions .....................................................................................................................237 B.1.1 Arithmetic Instructions .......................................................................................................................... 238 B.1.2 Comparison Instructions ....................................................................................................................... 240 B.1.3 Control Instructions ................................................................................................................................ 242 B.1.4 Logical Instructions ................................................................................................................................ 243 B.1.5 Miscellaneous Instructions .................................................................................................................... 245 B.2 Registers .............................................................................................................................................246 B.2.1 Register Summary ................................................................................................................................... 246 B.2.2 Register Definitions ................................................................................................................................ 247 APPENDIX C MISCELLANEOUS ACCELERATOR SPECIFICATIONS C.1 SNOW-3G Engines ..........................................................................................................................255 C.1.1 Specification Summary .......................................................................................................................... 255 C.1.2 Performance ............................................................................................................................................. 256 C.1.2.1 Introduction ................................................................................................................................. 256 C.1.3 Functional Description ........................................................................................................................... 257 C.1.3.1 SNOW Key Stream Generator ................................................................................................... 257 C.1.4 Feedback Logic and XOR ...................................................................................................................... 258 C.1.5 Examples .................................................................................................................................................. 259 C.1.6 Operations ............................................................................................................................................... 260 C.1.6.1 General Operations ..................................................................................................................... 260 C.1.6.2 Encryption Modes: UEA2 / 128-EEA1 .................................................................................... 260 C.2 KASUMI Engines ............................................................................................................................261 C.2.1 Introduction ............................................................................................................................................. 261 C.2.1.1 Specification Summary .............................................................................................................. 261 C.2.2 KASUMI Engine Functional Description ............................................................................................ 261 xiv Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Contents C.2.2.1 General Processing ..................................................................................................................... 261 C.2.2.2 Examples ...................................................................................................................................... 262 C.3 Packet Processor — Programming ............................................................................................... 264 C.3.1 Introduction ............................................................................................................................................. 264 C.3.1.1 Purpose ......................................................................................................................................... 264 C.3.1.2 Scope ............................................................................................................................................. 264 C.3.1.3 Abbreviation and Definitions ................................................................................................... 264 C.3.1.4 Data Flow Table .......................................................................................................................... 264 C.3.2 ARC4 Algorithm ..................................................................................................................................... 265 C.3.3 AES-CCM for Basic Operations and IPSec Protocols ........................................................................ 266 C.3.3.1 Introduction ................................................................................................................................. 266 C.3.3.2 Authentication ............................................................................................................................. 267 C.3.3.3 Encryption .................................................................................................................................... 268 C.3.3.4 Implementation ........................................................................................................................... 269 C.3.3.5 Basic Operation ........................................................................................................................... 270 C.3.3.6 ESP ................................................................................................................................................ 276 C.3.4 AES-GMAC/AES-GCM for Basic Operations and IPSec Protocols ............................................... 279 C.3.4.1 Introduction ................................................................................................................................. 279 C.3.4.2 Basic Operation ........................................................................................................................... 279 C.3.4.3 IPSec .............................................................................................................................................. 281 C.3.5 SRTP/SRTCP Protocols ......................................................................................................................... 289 C.3.5.1 Introduction ................................................................................................................................. 289 C.3.5.2 Packet Format .............................................................................................................................. 289 C.4 Context Control Words ................................................................................................................... 290 C.4.0.1 Outbound Processing ................................................................................................................. 292 C.4.0.2 Inbound Processing .................................................................................................................... 293 C.4.1 MACsec Protocol .................................................................................................................................... 295 C.4.1.1 Introduction ................................................................................................................................. 295 C.4.1.2 Packet Format .............................................................................................................................. 296 C.4.1.3 Context Control Words .............................................................................................................. 297 C.4.1.4 Outbound Processing ................................................................................................................. 299 C.4.1.5 Inbound Processing .................................................................................................................... 301 C.4.2 DTLS Protocol ......................................................................................................................................... 303 C.4.2.1 Introduction ................................................................................................................................. 303 C.4.2.2 Supported Features .................................................................................................................... 303 C.4.2.3 Packet Format .............................................................................................................................. 304 C.4.2.4 Context Control Words .............................................................................................................. 304 C.4.2.5 Outbound Processing ................................................................................................................. 306 C.4.2.6 Inbound Processing .................................................................................................................... 309 C.4.3 SSL/TLS Protocol ................................................................................................................................... 312 C.4.3.1 Introduction ................................................................................................................................. 312 C.4.3.2 Supported Features .................................................................................................................... 313 C.4.3.3 Packet Format .............................................................................................................................. 313 C.4.3.4 Context Control Words .............................................................................................................. 314 C.4.3.5 SSL MAC ...................................................................................................................................... 315 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors xv Contents C.4.3.6 Outbound Processing ................................................................................................................. 316 C.4.3.7 Inbound Processing .................................................................................................................... 319 C.5 Public Key Accelerator (PKA) .......................................................................................................322 C.5.1 PKA Firmware Architecture Overview ............................................................................................... 322 C.5.2 Command and Vector Copy and Zeroization .................................................................................... 324 C.5.3 PKI Command Interface ........................................................................................................................ 326 C.5.4 Main PKI Command Interface .............................................................................................................. 326 C.5.4.1 Descriptor Ring Management ................................................................................................... 326 C.5.4.2 Descriptor Ring Control/Status Words ................................................................................... 326 C.5.5 PKI Command and Result Descriptors ............................................................................................... 332 C.5.5.1 Command Descriptor Contents ................................................................................................ 332 C.5.5.2 Result Descriptor Contents........................................................................................................ 334 C.5.5.3 PKI Command/Result Specifics (Firmware Dependent) ..................................................... 337 C.5.5.4 Restrictions on PKA Operations ............................................................................................... 353 C.5.6 PKI Key Decrypt Key Management Interface .................................................................................... 359 C.5.6.1 AES Byte Order Example ........................................................................................................... 361 C.5.6.2 PKI Key Decrypt Keys Storage (PKI_KDK_0_[0:7] … _3_[0:7]) ........................................... 362 C.5.6.3 PKI Key Decrypt IVs Storage (PKI_KD_IV_0_[0:3] … _3_[0:3]) .......................................... 364 C.5.6.4 PKI Key Decrypt CTR Mode Increment Storage (PKI_KD_INCR_0 … _3) ....................... 366 C.5.6.5 PKI Key Decrypt Key Control Words ...................................................................................... 368 C.5.7 PKI Engine Boot-Up and Internal Error Reporting ........................................................................... 369 C.6 Conventions Used in this Manual ................................................................................................370 C.6.1 Register Information .............................................................................................................................. 370 APPENDIX D INLINE PACKET ENGINE D.1 Crypto Packet Processor Processing Overview .........................................................................371 D.1.1 Crypto Packet Processor Terms ........................................................................................................... 371 D.1.1.1 Tokens .......................................................................................................................................... 372 D.1.1.2 Context ......................................................................................................................................... 372 D.2 Configuring the Crypto Packet Processor ..................................................................................372 D.2.1 Enabling Protocol and Algorithm Support ........................................................................................ 372 D.2.2 Context Fetch Modes ............................................................................................................................. 372 D.2.3 Packet Processing Modes ...................................................................................................................... 373 D.3 Pseudo Random Number Generator ...........................................................................................375 D.3.1 Purpose .................................................................................................................................................... 375 D.3.2 Architecture ............................................................................................................................................. 376 D.3.3 Functional Description .......................................................................................................................... 376 D.3.4 Generation of DT .................................................................................................................................... 377 D.3.5 Generation of Keys ................................................................................................................................. 378 D.3.6 Performance ............................................................................................................................................ 378 D.4 Input Token Definition ..................................................................................................................379 D.4.1 Introduction ............................................................................................................................................ 379 D.4.2 Input Token Diagram ............................................................................................................................ 379 D.4.2.1 Input Token Header ................................................................................................................... 379 D.5 Processing Instructions ..................................................................................................................392 xvi Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Contents D.5.1 Instruction Types .................................................................................................................................... 392 D.5.1.1 Operational Data Instructions (Type 1) ................................................................................... 392 D.5.1.2 IP Header Instructions (Type 2) ............................................................................................... 392 D.5.1.3 Post-Process Instructions (Type 3) ........................................................................................... 393 D.5.1.4 Result Instructions (Type 4) ...................................................................................................... 393 D.5.1.5 Context Control Instructions (Type 5) ..................................................................................... 393 D.5.1.6 Special Instructions (Type 6) ..................................................................................................... 393 D.5.2 Instruction Sequencing .......................................................................................................................... 393 D.5.2.1 Sequencing Rules ........................................................................................................................ 393 D.5.3 Instruction Format ................................................................................................................................. 394 D.5.4 Operational Data Instructions .............................................................................................................. 396 D.5.4.1 Direction Instruction .................................................................................................................. 396 D.5.4.2 PRE_CHECKSUM Instruction .................................................................................................. 397 D.5.4.3 INSERT Instruction .................................................................................................................... 398 D.5.4.4 INSERT Instruction Example – NOP ....................................................................................... 401 D.5.4.5 INSERT_CTX Instruction .......................................................................................................... 402 D.5.4.6 REPLACE Instruction ................................................................................................................ 403 D.5.4.7 RETRIEVE Instruction ............................................................................................................... 403 D.5.4.8 MUTE Instruction ....................................................................................................................... 404 D.5.5 IP Header Instructions ........................................................................................................................... 405 D.5.5.1 IPv4 Instruction ........................................................................................................................... 406 D.5.5.2 IPv4_CHECKSUM Instruction ................................................................................................. 407 D.5.5.3 IPv6 Instruction ........................................................................................................................... 407 D.5.6 Post-Process Instructions ...................................................................................................................... 408 D.5.6.1 INSERT_REMOVE_RESULT (IRR) Instruction ..................................................................... 409 D.5.6.2 REPLACE_BYTE Instruction .................................................................................................... 423 D.5.6.3 Reserved Instructions ................................................................................................................ 424 D.5.7 Result Instructions ................................................................................................................................. 424 D.5.7.1 VERIFY_FIELDS Instruction ..................................................................................................... 424 D.5.8 Context Control Instructions ................................................................................................................ 425 D.5.8.1 CONTEXT_ACCESS Instruction .............................................................................................. 425 D.5.9 Bypass Token Data – Special Instruction ............................................................................................ 429 D.6 Result Token Definition ............................................................................................................... 429 D.7 Pre and Post-Processing by Host Software ................................................................................ 432 D.7.1 Preprocessing .......................................................................................................................................... 432 D.7.2 Post-Processing ....................................................................................................................................... 433 D.7.2.1 Result Token ................................................................................................................................ 433 D.7.2.2 Appended Data ........................................................................................................................... 433 D.7.2.3 Suggested Post-Processing Operations ................................................................................... 433 D.8 Context Record Definition ............................................................................................................ 433 D.8.1 Context Record Format ......................................................................................................................... 433 D.8.2 Context Control Words ......................................................................................................................... 436 D.8.2.1 Control Word 0 Field Encoding ............................................................................................... 436 D.8.2.2 Context Control Word 1 Definition ......................................................................................... 439 D.8.2.3 Key ................................................................................................................................................ 443 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors xvii Contents D.8.2.4 Hash Digest ................................................................................................................................. 444 D.8.2.5 Security Parameter Index .......................................................................................................... 448 D.8.2.6 Sequence Number Processing ................................................................................................... 449 D.8.2.7 IV Data .......................................................................................................................................... 452 D.8.3 Examples for Most Common Scenarios .............................................................................................. 453 D.8.3.1 Basic Encryption Operation ...................................................................................................... 453 D.8.3.2 Basic Hash Operation ................................................................................................................. 454 D.8.3.3 Combined Basic Encrypt Hash Operations ............................................................................ 454 D.8.3.4 IPSec Hash Only Operation ...................................................................................................... 454 D.8.3.5 IPSec Encryption Only Operation ............................................................................................ 454 D.8.3.6 IPSec Combined Encryption Hash Operation ........................................................................ 454 D.8.3.7 Typical Initialization Values for IPSec Operations ................................................................ 455 D.9 Register and Memory Map ............................................................................................................456 D.9.1 Configuration Registers ......................................................................................................................... 463 D.9.1.1 Packet Engine Token Control ................................................................................................... 464 D.9.1.2 Packet Engine Context Control ................................................................................................. 466 D.9.1.3 Packet Engine Interrupts ........................................................................................................... 469 D.9.1.4 Packet Engine Data Fetch Control ............................................................................................ 470 D.9.1.5 Crypto Packet Processor Input and Output Transfer Control/Status Register ................ 471 D.9.1.6 Packet Engine Configuration .................................................................................................... 473 D.9.2 PRNG Registers ...................................................................................................................................... 474 D.9.2.1 PRNG Seed Register (PRNG_SEED_L, PRNG_SEED_H) .................................................... 476 D.9.2.2 PRNG DES Key Registers (PRNG_KEY0_L, PRNG_KEY0_H, PRNG_KEY1_L, PRNG_KEY1_H) ............................................................................................................................................. 476 D.9.2.3 PRNG Output Registers (PRNG_RES0_L, PRNG_RES0_H, PRNG_RES1_L, PRNG_RES1_H) .............................................................................................................................................. 478 D.9.2.4 PRNG LFSR Registers (PRNG_LFSR_L, PRNG_LFSR_H) ................................................... 480 D.10 Protocol Compliancy .....................................................................................................................481 D.10.1 Introduction .......................................................................................................................................... 481 D.10.2 Disclaimer .............................................................................................................................................. 481 D.10.3 IP Header ............................................................................................................................................... 481 D.10.4 ESP Processing ...................................................................................................................................... 482 D.10.5 AH Processing ...................................................................................................................................... 483 D.10.6 SSL Processing ...................................................................................................................................... 485 D.10.7 TLS Processing ...................................................................................................................................... 486 D.10.8 DTLS Processing ................................................................................................................................... 487 D.10.9 SRTP/SRTCP Processing .................................................................................................................... 488 D.10.10 MACsec Processing ............................................................................................................................ 488 APPENDIX E INLINE PACKET ENGINE — TOKEN EXAMPLES E.1 Introduction .......................................................................................................................................491 E.1.1 Purpose ..................................................................................................................................................... 491 E.2 Token Examples — Basic Operations ..........................................................................................491 E.2.1 Bypass Packet Token (IPv4) ................................................................................................................... 492 E.2.2 Bypass Packet Token (IPv6) ................................................................................................................... 493 E.2.3 ESP Outbound Packet Token (IPv4, Transport Mode) ...................................................................... 494 xviii Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Contents E.2.3.1 With CBC Mode .......................................................................................................................... 494 E.2.3.2 With CTR Mode ........................................................................................................................... 495 E.2.3.3 With GCM .................................................................................................................................... 496 E.2.4 ESP Inbound Packet Token (IPv4, Transport Mode) ......................................................................... 498 E.2.4.1 With CBC Mode .......................................................................................................................... 498 E.2.4.2 With CTR Mode ........................................................................................................................... 500 E.2.4.3 With GCM .................................................................................................................................... 502 E.2.5 ESP Inbound Packet Token (IPv4, Transport Mode, Jumbo) ........................................................... 504 E.2.5.1 With CBC Mode .......................................................................................................................... 504 E.2.5.2 With GCM .................................................................................................................................... 506 E.2.6 ESP Outbound Packet Token (IPv4, Tunnel Mode) ........................................................................... 508 E.2.7 ESP Inbound Packet Token (IPv4, Tunnel Mode) .............................................................................. 509 E.2.8 ESP Outbound Packet Token (IPv6, Transport Mode) ...................................................................... 511 E.2.9 ESP Inbound Packet Token (IPv6, Transport Mode) ......................................................................... 513 E.2.10 AH Outbound Packet Token (IPv4, Transport Mode) .................................................................... 515 E.2.11 AH Inbound Packet Token (IPv4, Transport Mode) ....................................................................... 517 E.2.12 AH Outbound Packet Token (IPv4, Tunnel Mode) ......................................................................... 519 E.2.13 AH Outbound Packet Token (IPv4, Tunnel Mode, Jumbo) ........................................................... 521 E.2.14 AH Inbound Packet Token (IPv4, Tunnel Mode) ............................................................................ 524 E.2.15 AH Inbound Packet Token (IPv4, Tunnel Mode) Using Mute Instruction .................................. 526 E.2.16 AH Outbound Packet Token with Routing Extension Header (IPv6, Transport Mode) ........... 527 E.2.17 AH Outbound Packet Token with Multiple Extension Headers (IPv6, Transport Mode) ........ 529 E.2.18 AH Inbound Packet Token (IPv6, Transport Mode) ....................................................................... 532 E.2.19 AH Outbound Packet Token (IPv6, Tunnel Mode) ......................................................................... 535 E.2.20 AH Inbound Packet Token (IPv6, Tunnel Mode) ............................................................................ 537 E.2.21 sRTP Outbound — Packet Token (IPv4 — UDP — RTP) ............................................................... 539 E.2.22 sRTP Inbound — Packet Token (IPv4 — UDP — RTP) .................................................................. 541 E.2.23 Simple Token Examples ....................................................................................................................... 543 E.3 Token Examples - Advanced Operations .................................................................................... 544 E.3.1 Basic Processing ...................................................................................................................................... 544 E.3.1.1 Outbound ARC4 .......................................................................................................................... 544 E.3.2 ESP ............................................................................................................................................................ 545 E.3.2.1 ESP Outbound Packet Token (IPv4, Transport Mode, AES-CCM) ...................................... 545 E.3.2.2 ESP Inbound Packet Token (IPv4, Transport Mode, AES-CCM) ......................................... 547 E.3.2.3 ESP Outbound Packet Token (IPv4, Transport Mode, AES-GMAC) .................................. 549 E.3.2.4 ESP Inbound Packet Token (IPv4, Transport mode, AES-GMAC) ...................................... 551 E.3.2.5 ESP Outbound Packet Token (IPv4, Transport Mode with Encryption and SHA-2 Authentication) ............................................................................................................................................................... 553 E.3.2.6 ESP Inbound Packet Token (IPv4, Transport Mode with Encryption and SHA-2 Authentication) ................................................................................................................................................................... 555 E.3.3 AH ............................................................................................................................................................. 557 E.3.3.1 AH Outbound Packet Token (IPv4, Transport Mode, AES-GMAC) ................................... 557 E.3.3.2 AH Inbound Packet Token (IPv4, Transport Mode, AES-GMAC) ...................................... 559 E.3.4 SSL ............................................................................................................................................................. 561 E.3.4.1 Introduction ................................................................................................................................. 561 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors xix Contents E.3.4.2 SSL Outbound Packet Token (ARC4) ....................................................................................... 561 E.3.4.3 SSL Inbound Packet Token (ARC4) .......................................................................................... 563 E.3.5 DTLS ......................................................................................................................................................... 565 E.3.5.1 Introduction ................................................................................................................................. 565 E.3.5.2 DTLS Outbound Packet Token (AES-CBC) ............................................................................. 565 E.3.5.3 DTLS Inbound Packet Token (AES-CBC) ................................................................................ 567 E.3.6 MACsec .................................................................................................................................................... 569 E.3.6.1 MACsec Outbound Packet Token (AES-GCM) ...................................................................... 569 E.3.6.2 MACsec Inbound Packet Token (AES-GCM) ......................................................................... 571 E.3.7 SRTCP ....................................................................................................................................................... 573 E.3.7.1 SRTCP Outbound Packet Token (AES-ICM) ........................................................................... 573 E.3.7.2 SRTCP Inbound Packet Token (AES-ICM) .............................................................................. 575 APPENDIX F REFERENCES F.1 KASUMI References ........................................................................................................................577 F.2 Packet Processor References ...........................................................................................................577 F.3 SNOW-3G References .....................................................................................................................578 F.4 Public Key Accelerator References ...............................................................................................578 F.4.1 Open Specifications and Standards ...................................................................................................... 578 F.5 Inline Packet Engine (Token Example) References ...................................................................582 F.6 SGMII MAC Interface .....................................................................................................................584 GLOSSARY, CONVENTIONS AND STANDARDS ..................................................................... 585 G.1 Conventions and Standards ..........................................................................................................585 G.2 Glossary .............................................................................................................................................591 INDEX ................................................................................................................................ 597 xx Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors P REFACE About this Manual This manual describes the interfaces supported in the TILE-Gx™ processor. Intended Audience This manual is intended for use by hardware engineers. For information how to access I/O devices via software, refer to MDE System Programmer’s Guide UG509 Manual Contents Description Chapters are grouped in sections called parts, based on the interface type: memory, high speed interfaces, common interfaces, and so on. This manual is organized as follows: • The Preface provides an overview of this manual, information about contacting customer support, and general instructions about how registers are described. • Chapter 1: I/O Device Introduction. This chapter identifies the registers that are common to all supported interfaces. • Chapter 2: Tile Processor. This chapter provides a detailed description of the Tile Processor’s system architecture. It describes memory, interrupts, communication within a processor via the software-visible dynamic networks, Special Purpose Registers (SPRs), and in-tile system devices such as counters and timers. • Chapter 3: Double Data Rate SDRAM (DDR3) Interface. This chapter provides a detailed description of the memory controller, how the memory interface manages data flow, data ordering, performance features, how errors are handled, interrupts and memory interface registers. • Chapter 4: PCIe Controller Architecture (TRIO) This chapter describes how to integrate the Tile processors with a PCI system. • Chapter 5: PCIe MAC Interface. This chapter describes booting, deadlock avoidance, power management, and PCIe registers. • Chapter 6: mPIPE Architecture. This chapter describes line rate services for the packet interfaces. • Chapter 7: XAUI MAC Interface. This chapter describes the XAUI MAC interface, flow control, interrupts, and registers. • Chapter 8: SGMII MAC Interface. This chapter describes the SGMII MAC interface, flow control, interrupts, and registers. • Chapter 9: TILE-Gx Interlaken Interface. This chapter describes the TILE-Gx Interlaken port, which provides a channelized packet interface between mPIPE and 1 or more SERDES lanes. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice xxi Preface xxii • Chapter 10: USB Interface. This chapter describes the SGMII MAC interface, flow control, interrupts, and registers. • Chapter 11: Common Accelerator Interface (MiCA). This chapter describes the architecture of the TILE-Gx™ Multicore iMesh Coprocessing Accelerator (MiCA™). The MiCA provides a common front-end (both SW and HW) to various IO off-load or acceleration functions, for example Crypto or Compression. • Chapter 12: Cryptographic Accelerator Interface. This chapter describes the TILE-Gx Crypto implementation of Multicore iMesh Coprocessing Accelerator (MiCA™). • Chapter 13: Compression Accelerator Interface. This chapter describes those features unique to the compression functionality (MiCA). • Chapter 14: Flexible I/O Interface. This chapter provides an overview of the controller architecture used to manage the flexible I/O pins. • Chapter 15: Rshim Interfaces. This chapter describes the Rshim, which contains chip-level services for booting and debugging. It also hosts a number of the low speed interfaces including UARTs, I2C-Masters, I2C-Slave, and serial peripheral interface (SPI). These interfaces are described in the chapters 14 through 17. • Chapter 16: UART Interfaces. This chapter describes the UART device interface. It describes boot options, flow control, error handling and associated interrupts, and registers. • Chapter 17: I2C Master Interface. This chapter describes I2C Master Interface, which provides an interface for tiles to write and read an external I2C devices. • Chapter 18: I2C Slave Interface. This chapter describes the I2C Slave Interface, which is the interface to an external I2C device. • Chapter 19: SPI Interface. This chapter describes the SPI SROM interface, which provides an interface for tiles to write and read an off-chip SPI SROM. • Appendix A:: JTAG Interface. This appendix describes the JTAG interface, which provides an instruction register for reading and writing configuration registers within the Rshim. • Appendix B:: Classifier Instructions and SPRs. This appendix provides additional information about the classifier instructions and special purpose registers referenced in Chapter 6: mPIPE Architecture. • Appendix C:: Miscellaneous Accelerator Specifications. This appendix provides additional information about the four types of accelerators included with the TILE-Gx™ family of processors. • Appendix D:: Inline Packet Engine. This appendix provides a description of the crypto packet processor, information on how to configure it, and descriptions of the Pseudo Random Number Generator and input tokens. • Appendix E:: Inline Packet Engine — Token Examples. This appendix describes the format and use of sample input tokens. • Appendix F:: References. This appendix lists source materials and additional publications. • Glossary, Conventions and Standards. • Index. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Preface Related Documents Additional documentation from Tilera Corporation is available including the following: • Multicore Development Environment Release Notes, UG208 • TILE-GX36 Tile Processor™ Preliminary Data Sheet, DS400 • TILE-Gx Instruction Set Architecture Specification (UG401) Technical or Customer Support You can reach Tilera Corporation Customer Support in the following ways: • Visit the Tilera Web site at http://www.tilera.com/ • E-mail questions to support@tilera.com Product Information You can obtain product information from the Tilera Corporation Web site, from the product CDROM, or from the printed publications (manuals). Tilera Corporation is online at http://www.tilera.com/. Notation Conventions Text conventions used in this manual are as follows: Table 1. Notation Conventions Example Description Close command (File menu) Titles in reference sections indicate the location of an item within the IDE’s menu system (for example, the Close command appears on the File menu). Write In Progress (WIP) bit Courier text indicates the names of: • A bit or bitfield, for example the CHANNEL bit • A special purpose register (SPR), for example the RSH_COORD SPR • Code sample • Application, for example the tile-monitor application • Command, for example the link command • State, for example the IDLE state RSHIM MMIO registers Blue and underlined text in Courier font indicates these are hypertext links to HTML files associated with the interface, or links to a specific web site. Chapter 7: XAUI MAC Interface Blue text in text font (Palatino) indicates a (cross-reference) Hypertext link to text elsewhere in the manual. If you are reading a soft-copy of this manual, you can click on the link to jump directly to the referenced section. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice xxiii Preface Table 1. Notation Conventions Example Description Note: For correct operation, ... A Note provides supplementary information on a related topic. In the online version of this book, the word Note appears instead of this symbol. Caution: Incorrect device operation might result if ... Caution: Device damage might result if ... A Caution identifies conditions or inappropriate usage of the product that could lead to undesirable results or product damage. In the online version of this book, the word Caution appears instead of this symbol. Conventions for Register Descriptions Several notational conventions are utilized in this document. The following section describes these conventions. Conventions for Processor Families Registers for each interface are described with the following: • A narrative that includes addressing information. • Register diagrams. • Register bit description tables, like the one shown in Figure 1. Figure 1. Sample Bit Description Table Byte and Bit Order The Tile Processor Architecture is little-endian. When sets of bytes are described or displayed in this document, bytes with more significance are displayed to the left of bytes with lesser significance. More significant bytes are always numbered with a higher number than less significant bytes (LSBs). When data is stored in memory, bytes that are of greater significance are stored in higher numbered memory addresses than bytes of less significance. When sets of bits are described or displayed in this document, bits of higher significance are displayed to the left of bits with lower significance. For instance, if 32 bits are to be displayed and are numbered from 0 to 31, bit 31 is displayed to the left of bit 0. When groups of bits are operated on xxiv Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Preface as integers, the Tile Processor Architecture carries from the less significant bits to the more significant bits when adding. Bits numbered with a higher number have greater significance than bits with a lower number. Reserved Fields Bit fields in registers for computational architectures are many times unused. In the Tile Processor Architecture, unused bits are considered reserved zero (reserved 0). When bits labeled as reserved 0 are read they are not guaranteed to return zero. Bits denoted as reserved 0, must be written as zero. Bits that are ignored by the hardware are explicitly called out as being writeignored. Numbering The default base used in this document is base ten, or decimal representation. Any use of a bare numeric is considered to be a decimal number. Hexadecimal numbering is also used widely in this document. When a numeric is to be interpreted as a hexadecimal (base sixteen) number, the prefix “0x” is prepended to the number. For example, the number 74 can also be expressed as 0x4A when written in hexadecimal. When ranges of bits are numbered as a subset of a larger set of ordered bits a bracket notation is used. The notation contains one or two numbers separated by a colon. If only one number is specified, the numbered bit position is the bit referenced. In example, if “bus” is a 32-bit bus that is numbered 31 to 0 and the text describes bit 5, bus[5] is appropriate nomenclature to signify that bit. For bit ranges, the left number is the higher-ordered bit location and the right number is the lower-ordered bit location. Bit ranges are inclusive. This nomenclature is consistent with the default manner in which little-endian bit ranges are denoted. For example, if word is a 32-bit word numbered 31 to 0 and the text describes the bits from bit 5 to bit 20, the appropriate manner to denote that is word[20:5]. Figure 2 shows an example of how bitfields are graphically presented in this document. Bits[31:21] are shown as reserved bits. 31 21 20 5 4 0 First Field Second Field Third Field Figure 2. Bitfield Example Figure 3 shows four bitfields that are logically represented along with a gap. The gap is not reserved, but is instead allocated by another function. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice xxv Preface 26 25 24 23 22 21 20 58 57 56 55 54 53 52 51 010 s s d SrcBDest_Y2 - Dest SrcA_Y2[0:0] - Src[0:0] SrcA_Y2[5:1] - Src[5:1] Opcode_Y2 - 0x2 Figure 3. Bitfield Example with Fields Allocated by Other Functions xxvi Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 1 I/O D EVICE I NTRODUCTION 1.1 Overview The TILE-Gx™ family of processors contains numerous on-chip I/O devices to facilitate direct connections to DDR3 memory, Ethernet, PCI Express, USB, I2C, and other standard interfaces. This chapter provides a brief overview of the on-chip I/O devices. For additional information about the I/O devices, refer to individual chapters in the book. For detailed system programming information refer to the Special Purpose Registers (SPRs) and the associated device API guides. 1.1.1 Tile-to-Device Communication Tile processors communicate with I/O devices via loads and stores to MMIO (Memory Mapped IO) space. The page table entries installed in a Tile’s Translation Lookaside Buffer (TLB) contain a MEMORY_ATTRIBUTE field, which is set to MMIO for pages that are used for I/O device communication. The X,Y fields in the page table entry indicate the location of the I/O device on the mesh and the translated physical address is used by the I/O device to determine the service or register being accessed. Since each I/O TLB entry contains the X,Y coordinate of the I/O device being accessed, each device effectively has its own 40-bit physical address space for MMIO communication that is not shared with other devices or Tile physical memory space. This physical memory space is divided into the fields shown in Figure 1-1 and defined in Table 11. Note: Not all I/O devices use this partitioning. For example MICA does not have Regions or Service Domains. It uses a different type of division, which is described in “Common Accelerator Interface (MiCA)” on page 175. 31 0 A B RESERVED D E Offset: 0...E Region: 0...D Reserved: 0...C Service Domain: 0...B Channel: 0...A Figure 1-1. TILE-Gx Device Address Space Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 1 Chapter 1 I/O Device Introduction Table 11. TILE-Gx Physical Memory Space Descriptions Bits Bit Name Required Size Description A Channel No Variable Used when more than one device shares the same mesh location. B Service Domain No Variable Used to index “permissions” table and allow/deny access to specific device services. C Reserved No Variable Any “middle” bits of address that are not used. D Region No Variable Selects service being accessed (for example register space vs. DMA descriptor post). E Offset Yes Variable Address within the “Region” being accessed. Each device has registers in Region-0 used to control and monitor the device. Devices can also implement additional MMIO address spaces for device communication protocols, such as posting DMA descriptors or returning buffers. System software is responsible for creating and maintaining the page table mappings that provide access to device services. 1.1.2 Coherent Shared Memory I/O devices that provide bulk data transport utilize the high-performance, shared memory system implemented on TILE-Gx processors. All Tile memory system reads and writes initiated from an I/O device are delivered to a home Tile as specified in the physical memory attributes for the associated cacheline. I/O TLBs and/or memory management units (MMUs) are used to translate user or external I/O domain addresses into Tile physical addresses. This provides protection, isolation, and virtualization via a standard virtual memory model. 1.1.3 Device Protection In addition to the protection provided by the TLB for MMIO loads and stores, devices can provide additional protection mechanisms via the service domain field of the physical address. This allows, for example, portions of a large I/O physical address space to be fragmented, such that services can be allowed/denied to particular user processes without requiring dedicated (smaller) TLB mappings for each allowed service. 1.1.4 Interrupts Devices interrupts are delivered to Tile software via the Tile Interprocess Interrupt (IPI) mechanism. Each Tile has four IPI MPLs, each with 32 interrupt events. I/O interrupts have programmable bindings in their MMIO register space, which specify the target Tile, interrupt number (also referred to as the IPI Minimum Protection Level or IPI MPL), and event number. System software can choose to share Tile interrupt event bits among multiple I/O devices or dedicate the interrupt bits to a single I/O interrupt. Interrupt bits can also be shared between I/O and Tile-to-Tile interrupts. I/O devices implement interrupt status and enable bits to allow interrupt sharing and coalescing. 2 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Overview 1.1.5 Device Discovery To facilitate a common device initialization framework, the TILE-Gx processors contain registers and I/O structures that allow non-device-specific software to “discover” the connected I/O devices for a given chip. After discovery, device-specific software drivers can be launched as needed. All TILE-Gx processors contain an Rshim. The Rshim contains chip-wide services including boot controls, diagnostics, clock controls, reset controls, and global device information. The Rshim’s RSH_FABRIC_DIM, RSH_FABRIC_CONN, and RSH_IPI_LOC registers provide Tilefabric sizing, I/O connectivity, and IPI information to allow software to enumerate the various devices. The common registers located on each device contain the device identifier used to launch device-specific driver software. In order for Level-1 boot software to perform discovery, it must first find the Rshim. This is done by reading the RSH_COORD SPR located in each Tile. Thus the basic device discovery flow is: 1. Read the RSHIM_COORD SPR to determine the Rshim location on the mesh. 2. Install an MMIO TLB entry for the Rshim. 3. Read the RSH_FABRIC_CONN vectors from Rshim to determine I/O device locations. 4. Install MMIO TLB entries for each I/O device. 5. Read the RSH_DEV_INFO register from each device to determine what the device type is, and launch any device-specific software. 1.1.6 Common Registers While each device has unique performance and API requirements, a common device architecture allows a modular software driver model and device initialization process. The first 256 bytes of MMIO space contains the “common” registers that all I/O devices implement.1 The common registers are used for device discovery as well as basic physical memory initialization and MMIO page sizing. Table 12. Common Registers Register Address Description DEV_INFO 0x0000 This provides general information about the device attached to this port and channel. DEV_CTL 0x0008 This provides general device control. MMIO_INFO 0x0010 This provides information about how the physical address is interpreted by the I/O device. MEM_INFO 0x0018 This provides information about memory setup required for this device. SCRATCHPAD 0x0020 This is for general software use, and is not used by the I/O shim hardware for any purpose. 1. The “common registers” are located from 0x0000-0x0058. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 3 Chapter 1 I/O Device Introduction Table 12. Common Registers (continued) Register Address Description SEMAPHORE0 and SEMAPHORE1 0x0028 and 0x0030 This is for general software use, and is not used by the I/O shim hardware for any purpose. CLOCK_COUNT 0x0038 This is for general software use, and is not used by the I/O shim hardware for any purpose. HFH_INIT_CTL 0x0050 Initialization control for the hash-for-home tables. HFH_INIT_DAT 0x0058 Read/Write data for hash-for-home tables. Each of the major register sets (for example: the GPIO, UART, and MICA Crypto registers) for a specific device includes the common registers in the register set. The SCRATCHPAD register, for example, is a common register included in each of the register sets. The register set name prepends the register name as follows: • GPIO Register: GPIO_SCRATCHPAD register • UART Register: UART_SCRATCHPAD register • MICA_CRYPTO Register: MICA_CRYPTO_SCRATCHPAD register Registers beyond 0x100 contain the device specific registers. Register definitions can be found as part of the MDE build and are located in the HTML directory. The directory structure is as follows: • Memory Controller • GPIO • Rshim • I2C Slave • I2C Master • SROM • UART Compression • MiCA Compression Global • MiCA Compression Inflate Engine • MiCA Compression Deflate Engine • MiCA Compression User Context • MiCA Compression System Context Crypto 4 • MiCA Crypto Global • MiCA Crypto Engine • MiCA Crypto User Context • MiCA Crypto System Context Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Overview MPIPE / MACs • mPIPE • XAUI (Interface/MAC) • GbE (Interface/MAC) • Interlaken (Interface/MAC) • mPIPE SERDES Control TRIO / PCIe • TRIO • PCIe Interface (SERDES control, endpoint vs. root etc.) • PCIe Endpoint • PCIe Root Complex • PCIe SERDES Control USB • USB Host • USB Endpoint • USB Host MAC • USB Endpoint MAC Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 5 Chapter 1 I/O Device Introduction 6 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 2 TILE P ROCESSOR 2.1 System Architecture Overview The Tile Processor™ is a new class of multicore processing engine that delivers unprecedented levels of performance, flexibility, and power efficiency. The Tile processor is fully programmable using standard ANSI C and C++, which makes it easy to port existing applications to the Tile Processor environment. The device implements Tilera’s iMesh™ Multicore technology, which enables applications to be scaled across multiple cores or tiles. Combining multiple C and C++ programmable processor tiles with iMesh multicore technology enables the Tile Processor to achieve the performance of an ASIC in a software-programmable solution, which reduces development costs and shortens time-to-market. Each tile is a powerful full-featured processor that can independently run an entire operating system. Each tile implements a 64-bit integer processor engine, a memory management unit (MMU) including TLBs (Translation Lookaside Buffers), a register file, a program counter (PC), and an L1/ L2 cache subsystem. The tiles in the Tile Processor are connected to each other, to the external memory, and to the I/O by the Tilera iMesh multicore technology. Attaching the memory controllers and I/O to the iMesh allows any tile to access any memory and also allows any tile to service any I/O device. The iMesh also supplies very low-latency messaging and scalar transfers to user-level applications, enabling very efficient multi-programming. Tilera’s iMesh multicore technology enables the Tile Processor to provide performance scalability and high bandwidth/low latency communication between tiles. The Tile Processor implements a powerful protection mechanism of processing resources to allow fine-grained control and management by operating systems and/or virtual machine implementation. The Tile processor implements four protection levels, which can be used simultaneously to supply user level, operating system level, virtual machine level, and debug level programming. The protection system divides processing resources into functionally related groups, and protects each of these functions with an individual access control. This allows the distribution of the control over these processing resources to different levels of the support software stack. The Tile Processor supplies memory protection by implementing a virtual memory system with support for multiple operating systems running on multiple tiles. The virtual memory system implements 64-bit virtual addresses, and up to 64 bits of physical address space. The Tile processor can support a physical address mode for applications that do not require virtual memory. The memory system supports a number of coherent shared memory options with different performance characteristics to optimize performance for different kinds of workloads. The processor engine, the primary computational resource, is an asymmetric very long instruction word (VLIW) processor. Each instruction bundle is 64-bits wide and can encode either two or three instructions. Some instructions can be encoded in either two-wide or three-wide bundles, and some can be encoded in two-wide bundles only. The most common instructions and those with short immediates can be encoded in a three instruction format. Nearly all instructions have a single-cycle result latency, with the exception of complex SIMD instructions, multiplication, most Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 7 Chapter 2 Tile Processor floating-point and memory instructions. Nearly all of the multi-cycle instructions are fully pipelined, allowing additional instructions to be issued in the following cycles. The complex SIMD instructions, multiplication and floating-point instructions have a two-cycle result latency. Memory load instructions that cannot supply data immediately from the L1 data cache, will not stall until the register that is being written by the load instruction is attempted to be read by another operation. In this way, additional instructions can be issued during cache misses. Additional information about the processor engine and its instruction set is provided in the Instruction Set Architecture for TILE-Gx (UG401). 2.2 Memory Architecture The Tile Processor architecture defines a flat, globally shared 64-bit physical address space and a 64-bit virtual address space (note that Tile-Gx processors implement a 40-bit subset physical address and 42-bit subset virtual address). Memory is byte-addressable and can be addressed in 1, 2, 4 or 8 byte units, depending on alignment. Memory transactions to and from a tile occur via the iMesh. The globally shared physical address space provides the mechanism by which software running on different tiles, and I/O devices, share instructions and data. Memory is stored in off-chip DDR3 DRAM. Page tables are used to translate virtual addresses to physical addresses (page size range is 4 kB to 64 GB). The translation process includes a verification of protected regions of memory, and also a designation of each page of physical addresses as either coherent, non-coherent, uncacheable, or memory mapped I/O (MMIO). For coherent and non-coherent pages, values from recentlyaccessed memory addresses are stored in caches located in each tile. Uncacheable and MMIO addresses are never put into a tile cache. The Address Space Identifier (ASID) is used for managing multiple active address spaces. Recently-used page table entries are cached in TLBs (Translation Lookaside Buffers) in both tiles and I/O devices. Hardware provides a cache-coherent view of memory to applications. That is, a read by a tile or I/ O device to a given physical address will return the value of the most recent write to that address, even if it is in a tile’s cache. Instruction memory that is written by software (self-modifying code) is not kept coherent by hardware. Rather, special software sequences using the icoh instruction must be used to enforce coherence between data and instruction memory. Atomic operations include FetchAdd, CmpXchg, FetchAddGez, Xchg, FetchOr, and FetchAnd. Memory ordering is relaxed, and a memory fence instruction provides sequential ordering. See “Memory Consistency Model” on page 47. Virtual Address Space The virtual address is architecturally 64 bits, but is implemented as 42 bits in the Tile-Gx processor. Virtual addresses that are not sign-extended values (i.e. bits[63:41] of the VA are all 0’s or all 1’s) are illegal — the implication of this is that there are two legal VA regions, lower and upper, and an illegal region in the middle, as shown in Table 2-1. . It is illegal to do a memory operation (for example load or store), or to execute instructions from an illegal VA, or to take a branch from the lower to upper VA region (or vice-versa). An attempt to do so will result in an exception. 8 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Table 2-1. Virtual Address Space Address 264-1 Region Upper VA Region ... 264-241 264-241-1 Illegal VA Region ... 241 241-1 Lower VA Region ... 0 2.3 Memory Addressing 2.3.1 TLB Management Each TLB entry can be directly read and written by system software. The architecture does not prescribe the number of TLB entries that the Tile architecture contains. Rather, it sets a maximum number of TLB entries and allows the number of implemented TLB entries to be an implementation parameter. The special purpose registers, NUMBER_DTLB and NUMBER_ITLB, are read-only special purpose registers that denote how many TLB entries each of the respective TLBs contain. TILE-Gx implements 16 ITLB entries and 32 DTLB entries. The WIRED_DTLB and WIRED_ITLB entries specify the number of TLB entries that are managed completely by software and will not be selected by hardware for replacement. The REPLACEMENT_ITLB and REPLACEMENT_DTLB SPRs are maintained by hardware to generate a recommended replacement TLB entry. The specific algorithm used by hardware generates the replacement entry numbers is implementation-specific. TILE-Gx uses a random replacement algorithm. They are reset to the number of implementation-specific TLB entries minus one on processor reset. The recommended TLB entry will not be a wired entry. A given TLB entry can be read or written by first indexing the desired element and then by using the proper TLB_CURRENT_x SPR to read or write the entry indexed by the index register. To allow register indexing into the TLB, the DTLB_INDEX and ITLB_INDEX registers are used. There are three SPRs (TLB_CURRENT_VA, TLB_CURRENT_PA and TLB_CURRENT_ATTR), which access the three words in a TLB entry that is indexed by DTLB_INDEX or ITLB_INDEX. The TLB current registers do not contain state, but rather are indexes into the TLB state. TLB_CURRENT_x registers for each type of TLB, namely DTLB_CURRENT and ITLB_CURRENT are supported. To read a TLB entry, software writes the index into the DTLB_INDEX or ITLB_INDEX SPR and sets the top bit. The setting of the MSB causes the specified entry to be read and stored in the TLB_CURRENT_x SPRs. Software should issue a DRAIN instruction between setting the index register and reading data from the TLB_CURRENT_x registers. To write a TLB entry, software writes the index into the DTLB_INDEX or ITLB_INDEX SPR and writes the TLB_CURRENT_x SPRs in a particular order. The write of TLB_CURRENT_ATTR is the trigger for writing the entire TLB entry specified by the TLB_CURRENT_x registers into the actual TLB. This causes the TLB to be written. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 9 Chapter 2 Tile Processor Data TLB Number of Entries Register (NUMBER_DTLB) This register specifies how many data TLB entries there are. Speed Fast Minimum Protection Level DTLB_MISS 180 5HVHUYHG[ Figure 2-1: NUMBER_DTLB Register Diagram Table 2-2. NUMBER_DTLB Register Bit Descriptions Bits 10 Name 63:13 Reserved 12:0 NUM Reset Description Reserved 0 Number. TILE-Gx implements the bitfield 5:0; writes to bits 12:6 are ignored, and these bits are read as 0. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Instruction TLB Number of Entries Register (NUMBER_ITLB) This register specifies how many instruction TLB entries there are. Speed Fast Minimum Protection Level ITLB_MISS 180 5HVHUYHG[ Figure 2-2: NUMBER_ITLB Register Diagram Table 2-3. NUMBER_ITLB Register Bit Descriptions Bits Name 63:13 Reserved 12:0 NUM Reset Description Reserved 0 Number. TILE-Gx implements the bitfield [4:0]; writes to bits 12:5 are ignored, and these bits are read as 0. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 11 Chapter 2 Tile Processor Instruction TLB Replacement Index Register (REPLACEMENT_ITLB) This register specifies which instruction TLB entry should be replaced by the random replacement algorithm. Speed Fast Minimum Protection Level ITLB_MISS ,1'(; 5HVHUYHG[ Figure 2-3: REPLACEMENT_ITLB Register Diagram Table 2-4. REPLACEMENT_ITLB Register Bit Descriptions Bits 12 Name 63:12 Reserved 11:0 INDEX Reset Description Reserved 0 Index. For TILE-Gx this bitfield implements the bitfield 3:0; writes to bits 11:4 are ignored, and these bits are read as 0. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Data TLB Replacement Index Register (REPLACEMENT_DTLB) This register specifies which data TLB entry should be replaced by the random replacement algorithm. Speed Fast Minimum Protection Level DTLB_MISS ,1'(; 5HVHUYHG[ Figure 2-4: REPLACEMENT_DTLB Register Diagram Table 2-5. REPLACEMENT_DTLB Register Bit Descriptions Bits Name 63:12 Reserved 11:0 INDEX Reset Description Reserved 0 Index. Bitfield 4:0 is implemented, but writes to bits 11:5 are ignored, and these bits are read as 0. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 13 Chapter 2 Tile Processor Instruction TLB Entry VA Register (ITLB_CURRENT_VA) This register is used to read and write the virtual address of the main processor instruction TLB Entry. Speed Fast Minimum Protection Level ITLB_MISS 5HVHUYHG[ 931 Figure 2-5: ITLB_CURRENT_VA Register Diagram Table 2-6. ITLB_CURRENT_VA Register Bit Descriptions Bits 14 Name 63:12 VPN 11:0 Reserved Reset 0 Description Virtual Page Number. Reserved Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Instruction TLB Entry PA Register (ITLB_CURRENT_PA) This register is used to read and write the physical address of the main processor instruction TLB Entry. Speed Fast Minimum Protection Level ITLB_MISS 5HVHUYHG[ 3)1 5HVHUYHG[ Figure 2-6: ITLB_CURRENT_PA Register Diagram Table 2-7. ITLB_CURRENT_PA Register Bit Descriptions Bits Name 63:40 Reserved 39:12 PFN 11:0 Reserved Reset Description Reserved 0 Physical Frame Number. Reserved Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 15 Chapter 2 Tile Processor Instruction TLB Entry Attribute Register (ITLB_CURRENT_ATTR) This register is used to read and write the processor instruction TLB Entry attribute. Writing this register triggers the write of the ITLB. Speed Fast Minimum Protection Level ITLB_MISS 9 : 03/ 36 * $6,' 0(025<B$775,%87( &$&+(B+20(B0$33,1* 12B/'B$//2&$7,21 $'$37,9(B$//2&$7,21 3,1 5HVHUYHG[ &$&+(B35()(7&+ /2&$7,21B<B25B3$*(B2))6 /2&$7,21B;B25B3$*(B0$6 5HVHUYHG[ Figure 2-7: ITLB_CURRENT_ATTR Register Diagram Table 2-8. ITLB_CURRENT_ATTR Register Bit Descriptions Bits 16 Name Reset Description 63:48 Reserved Reserved 47:37 LOCATION_X_OR_PAGE_MASK 0 Location Override Target X field for MMIO page and the non-hash-for-home Coherent or NonCoherent pages; Page mask field for hash-for-home Coherent or NonCoherent pages. 36:26 LOCATION_Y_OR_PAGE_OFFSET 0 Location Override Target Y field for MMIO page and the non-hash-for-home Coherent or NonCoherent pages; Page offset field for hash-for-home Coherent or NonCoherent pages. 25 CACHE_PREFETCH 0 Cache Page Prefetch Attribute. Hardware prefetcher may generate prefetches. The TILE-GX does not implement the page-prefetch hardware, and the attribute is reserved for the future implementation. 24 Reserved 23 PIN Reserved 0 PIN. L2 and L3 cache allocation follows the TILE-Gx cache pinning rule. The attribute is used in Coherent and NonCoherent pages, and is ignored in the Uncacheable or MMIO pages. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Table 2-8. ITLB_CURRENT_ATTR Register Bit Descriptions (continued) Bits Name Reset Description 22 ADAPTIVE_ALLOCATION 0 L2 cache allocation follows the TILE-Gx Adaptive Allocation rule. The attribute is used in Coherent and NonCoherent pages, and is ignored in the Uncacheable or MMIO pages. 21 No_L1D_ALLOCATION 0 L1D cache will not be filled (ignored by the instruction stream requests). The attribute is used in Coherent and NonCoherent pages, and is ignored in the Uncacheable or MMIO pages. 20:19 CACHE_HOME_MAPPING 0 Describes how the home cache for each cacheline is determined. 18:17 16:9 Memory Attribute ASID 0 0 Value 0 Name HASH Meaning The home cache is computed from the cacheline's physical address using the default hash-for-home scheme. 3 TILE For all lines, the home is the tile whose X and Y coordinates are specified in the LOTAR_X_OR_PAGEMASK and LOTAR_Y_OR_PAGEOFFSET fields. Describes how accesses to memory via this translation are cached, if at all. Value 0 Name Meaning COHERENT Data is cached locally; loads and stores target the home cache upon a miss in the local cache, and the home cache invalidates the local cache if the data is changed in the L3. 1 NONCOHERENT Data is cached locally; loads and stores target the home cache upon a miss in the local cache, but the home cache does not invalidate the local cache if the data is changed in the L3. 2 UNCACHEABLE The data is never cached locally; loads and stores always target the memory controller. 3 MMIO The data is never cached locally; loads and stores target an I/O device whose address is given by the LOTAR_X_OR_PAGEMASK and LOTAR_Y_OR_PAGEOFFSET fields. Address Space Identifier Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 17 Chapter 2 Tile Processor Table 2-8. ITLB_CURRENT_ATTR Register Bit Descriptions (continued) Bits Name Reset Description 8 G 0 Global 7:4 PS 0 Page Size Value 0 1 2 3 4 5 6 7 8 9 10 11 12 18 Name 4K_PAGE 16K_PAGE 64K_PAGE 256K_PAGE 1M_PAGE 4M_PAGE 16M_PAGE 64M_PAGE 256M_PAGE 1G_PAGE 4G_PAGE 16G_PAGE 64G_PAGE Meaning Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: 3:2 MPL 0 Minimum Protection Level 1 W 0 Writable 0 V 0 Valid 4K-byte 16K-byte 64K-byte 256K-byte 1M-byte 4M-byte 16M-byte 64M-byte 256M-byte 1G-byte 4G-byte 16G-byte 64G-byte Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Data TLB Entry VA Register (DTLB_CURRENT_VA) This register is used to read and write the virtual address of the main processor data TLB Entry. Speed Fast Minimum Protection Level DTLB_MISS 5HVHUYHG[ 931 Figure 2-8: DTLB_CURRENT_VA Register Diagram Table 2-9. DTLB_CURRENT_VA Register Bit Descriptions Bits Name 63:12 VPN 11:0 Reserved Reset 0 Description Virtual Page Number. Reserved Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 19 Chapter 2 Tile Processor Data TLB Entry PA Register (DTLB_CURRENT_PA) This register is used to read and write the physical address of the main processor data TLB Entry. Speed Fast Minimum Protection Level DTLB_MISS 5HVHUYHG[ 3)1 5HVHUYHG[ Figure 2-9: DTLB_CURRENT_PA Register Diagram Table 2-10. DTLB_CURRENT_PA Register Bit Descriptions Bits 20 Name 63:40 Reserved 39:12 PFN 11:0 Reserved Reset Description Reserved 0 Physical Frame Number. Reserved Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Data TLB Entry Attribute Register (DTLB_CURRENT_ATTR) This register is used to read and write the processor data TLB Entry attribute. Writing this register triggers the write of the DTLB. Speed Fast Minimum Protection Level DTLB_MISS 9 : 03/ 36 * $6,' 0(025<B$775,%87( &$&+(B+20(B0$33,1* 12B/'B$//2&$7,21 $'$37,9(B$//2&$7,21 3,1 5HVHUYHG[ &$&+(B35()(7&+ /2&$7,21B<B25B3$*(B2))6(7 /2&$7,21B;B25B3$*(B0$6. 5HVHUYHG[ Figure 2-10: DTLB_CURRENT_ATTR Register Diagram Table 2-11. DTLB_CURRENT_ATTR Register Bit Descriptions Bits Name Reset Description 63:48 Reserved Reserved 47:37 LOCATION_X_OR_PAGE_MASK 0 Location Override Target X field for MMIO page and the non-hash-for-home Coherent or NonCoherent pages; Page mask field for hash-for-home Coherent or NonCoherent pages. 36:26 LOCATION_Y_OR_PAGE_OFFSET 0 Location Override Target Y field for MMIO page and the non-hash-for-home Coherent or NonCoherent pages; Page offset field for hash-for-home Coherent or NonCoherent pages. 25 CACHE_PREFETCH 0 Cache Page Prefetch Attribute: Hardware prefetcher may generate prefetches. The TILE-GX does not implement the page-prefetch hardware, and the attribute is reserved for the future implementation. 24 Reserved 23 PIN 22 ADAPTIVE_ALLOCATION ADAPTIVE ALLOCATION 21 No_L1D_ALLOCATION No_L1D_Allocation 20:19 Cache Home Mapping Cache Home Mapping 18:17 Memory Attribute Memory Attribute Reserved 0 PIN Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 21 Chapter 2 Tile Processor Table 2-11. DTLB_CURRENT_ATTR Register Bit Descriptions (continued) Bits Name Reset Description 16:9 ASID Address Space Identifier 8 G Global 7:4 PS Page Size Value 0 1 2 3 4 5 6 7 8 9 10 11 12 22 3:2 MPL 1 W 0 Writable 0 V 0 Valid Name 4K_PAGE 16K_PAGE 64K_PAGE 256K_PAGE 1M_PAGE 4M_PAGE 16M_PAGE 64M_PAGE 256M_PAGE 1G_PAGE 4G_PAGE 16G_PAGE 64G_PAGE Meaning Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: Page size: 4K-byte 16K-byte 64K-byte 256K-byte 1M-byte 4M-byte 16M-byte 64M-byte 256M-byte 1G-byte 4G-byte 16G-byte 64G-byte Minimum Protection Level Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Data TLB Index Register (DTLB_INDEX) This register is used to specify which data TLB entry is read and written by the DTLB_CURRENTx registers. The top bit of this register forces a read of the indexed TLB into the DTLB_CURRENT_x registers to occur. Several aspects of TLB read/write behavior bear mentioning: • Writing TLB_CURRENT_ATTR is the trigger for writing the entire TLB entry specified by the TLB_CURRENT_x registers back to the actual TLB. • After setting a TLB index register, a DRAIN instruction must be issued before the referenced TLB entry is readable from the TLB_CURRENT_x registers. Speed Fast Minimum Protection Level DTLB_MISS ,1'(; 5HVHUYHG[ / 5 Figure 2-11: DTLB_INDEX Register Diagram Table 2-12. DTLB_INDEX Register Bit Descriptions Bits Name Reset 0 Description 63 R 62 L Load from REPLACEMENT_DTLB. Reads as zero. 61:12 Reserved Reserved 11:0 INDEX 0 Read Index. TILE-Gx implements the bitfield 4:0; writes to bits 11:5 are ignored, and these Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 23 Chapter 2 Tile Processor Instruction TLB Index Register (ITLB_INDEX) This register is used to specify which instruction TLB entry is read and written by the ITLB_CURRENT_x registers. Writing a 1 into the top bit of this register forces a read of the indexed TLB into the ITLB_CURRENT_x registers to occur. Several aspects of TLB read/write behavior bear mentioning: • Writing TLB_CURRENT_ATTR is the trigger for writing the entire TLB entry specified by the TLB_CURRENT_x registers back to the actual TLB. • After setting a TLB index register, a DRAIN instruction must be issued before the referenced TLB entry is readable from the TLB_CURRENT_x registers. Speed Fast Minimum Protection Level ITLB_MISS ,1'(; 5HVHUYHG[ / 5 Figure 2-12: ITLB_INDEX Register Diagram Table 2-13. ITLB_INDEX Register Bit Descriptions Bits Name Reset Description 63 R 0 Read 62 L 0 Load from REPLACEMENT_ITLB. Reads as zero 61:12 Reserved 11:0 INDEX 2.3.1.1 Reserved 0 Index. TILE-Gx implements the bitfield 3:0; writes to bits 11:4 are ignored, and these bits are read as 0. TLB Miss Handling When any access occurs to an address that is not in the TLB, a TLB Miss occurs. When a write access occurs to an address that is in the TLB, but is not marked Writable, a TLB Access Violation occurs. In either case, the faulting address is loaded into the TLB’s bad address SPR: DTLB_BAD_ADDR or EX_CONTEXT_x. A TLB Miss or TLB Access Violation is then signaled to notify software of the event. The DTLB_BAD_ADDR_REASON SPR indicates the reason for a DTLB miss or access violation. Software is responsible for taking whatever action is required: filling in the missing TLB entry, paging in data from disk, terminating a process which has tried to write to a read only address, and so forth. System software is responsible for ensuring that there are never multiple DTLB entries that match on a translation. Multiple matches will cause a memory error to be signaled. 24 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Addressing Several scenarios exist where special precautions must be taken to ensure that multiple matching DTLB entries are not installed into the TLB: • Software inserts a translation for page P, with the global bit set, and there already exists a translation for page P for a particular ASID. • Software inserts an overlapping page due to differing page sizes To avoid multiple matching DTLB entries, system software can use the data TLB probe instruction (DTLBPR). This instruction takes a register containing a virtual address as its source operand and it checks the DTLB for any entries that match this virtual address. The ASID field is ignored when the lookup is performed. The data CPL is used. The match result is written into the DTLB_MATCH_0 SPR. The SPR will contain a 1 in each bit position corresponding to the DTLB entry that matched the virtual address. The DTLB probe instruction is also useful when upgrading a “read only” page to a “writable” page. Read and write access to the TLBs must be protected to prevent invalid TLB entries from being added. In order to accomplish this, the MPLs for the respective TLB Miss interrupts are used to determine what protection level is required to read or write the TLB entries for a particular TLB. If a TLB resource is accessed without sufficient privileges, a General Protection Violation interrupt is signaled. The General Protection Violation occurs at the minimum protection level of the faulting resource. When a GP Violation is signaled, the GPV_REASON SPR is filled in with the access violation cause. General Protection Violation Reason Register (GPV_REASON) Contains the reason that a GPV has occurred. Speed Slow Minimum Protection Level GPV 635B,1'(; 5HVHUYHG[ 0)B(5525 07B(5525 ,5(7B(5525 5HVHUYHG[ Figure 2-13: GPV_REASON Register Diagram Table 2-14. GPV_REASON Register Bit Descriptions Bits Name Reset Description 63:32 Reserved Reserved 31 IRET_ERROR 0 If there was a IRET violation. 30 MT_ERROR 0 If there was a move to SPR access violation. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 25 Chapter 2 Tile Processor Table 2-14. GPV_REASON Register Bit Descriptions (continued) Bits Name 29 MF_ERROR 28:14 Reserved 13:0 SPR_INDEX Reset 0 Description This bit indicates If there was a move from SPR access violation. Reserved 0 The index of an occurring SPR access violation. 2.4 Memory Consistency Model 2.4.1 Overview The Tile Processor architecture’s memory consistency model specifies the order in which memory operations from a processor become visible to other processors in the coherence domain. There are two main properties, P1 and P2, defined by the memory consistency model: instruction reordering rules and store atomicity. The Tile Processor architecture defines a relaxed memory consistency model in which: P1: Instruction Reordering Non-overlapping memory accesses from a given processor that reference shared pages can be reordered and can become visible to other processors sharing that page in an order different from the original program order, with the following restrictions: • Data dependencies through memory accesses from a single processor are enforced (RAW, WAW, and WAR) • Data dependencies through registers or memory determines local visibility order • Local ordering established by memory data dependencies or register dependencies does not determine global visibility order. Data writes (including atomic operations and flushes) must observe control dependencies. P2: Store Atomicity Stores performed by a processor appear to become visible simultaneously to all remote processors, but can become visible to the issuing processor before becoming globally visible (for example, by bypassing to a subsequent load through a write buffer). Atomic operations are atomic to all processors: bypassing to or from atomic operations is not allowed. The Tile Processor architecture provides the memory fence (MF) instruction to establish ordering among otherwise unordered instructions when such ordering is needed for correctness. Data memory operations in the program prior to the memory fence instruction are made globally visible before ANY operation after the memory fence. The Tile Processor architecture provides a FetchAdd, CmpXchg, FetchAddGez, Xchg, FetchOr, and FetchAnd operations to read and write a memory location atomically. The following code sequences illustrate the properties of the tile memory consistency model. In the examples that follow, memory addresses are denoted by x and y, are word aligned, and are assumed to contain the value 0 initially. All loads and stores are word-sized. The notation A B indicates that operation A becomes visible to all processors in the coherence domain before operation B becomes visible. Examples Listing 2-1. through Listing 2-5. below illustrate property P1— instruction reordering. Examples Listing 2-6. through Listing 2-8. illustrate property P2—store atomicity and write bypassing. Listing 2-1. Property P1—Instruction Reordering. Stores can reorder with stores to different locations and loads can 26 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Memory Consistency Model reorder with loads to different locations. Tile 0 sw [x] = 1 sw [y] = 1 | Tile 1 | lw r1 = [y] | lw r2 = [x] All outcomes for r1 and r2 are possible. The stores can be made visible in any order. Implementations are free to reorder data memory operations to different locations. Program order does not imply visibility order. Listing 2-2. Property P1—Instruction Reordering. Ordering is enforced through the memory fence instruction. Tile 0 sw [x] = 1 //M1 MF // M2 sw [y] = 1 // M3 | | | | Tile 1 lw r1 = [y] // M4 MF // M5 lw r2 = [x] // M6 The only illegal outcome is r1 == 1 and r2 == 0. Notice that this example is the same as in Listing 2-1., except that here we have an MF instruction inserted between the pair of stores on Tile 0 and also between the pair of loads on Tile 1. The use of the MF instruction ensures that M1M3 and M4M6. Therefore, if M3 is visible to M4, then M1 is visible to M6. Listing 2-3. Property P1—Instruction Reordering. Loads can reorder with stores to different locations. Tile 0 sw [x] = 1 //M1 lw r1 = [y] // M2 | Tile 1 | sw [y] = 1// M3 | lw r2 = [x]// M4 This example is similar to Listing 2-1., in that the loads and stores on each tile have no dependence and can be freely reordered. All outcomes are legal. Listing 2-4. Property P1—Instruction Reordering. Preventing loads from passing stores to different locations. Tile 0 sw [x] = 1 //M1 MF lw r1 = [y] // M2 | | | | Tile 1 sw [y] = 1// M3 MF lw r2 = [x]// M4 The only illegal outcome is r1 == r2 == 0. This example is similar to the one shown in Listing 2-3., except we now have MF instructions between the memory operations. The MF on Tile 0 causes M1M2, and the MF on Tile 1 causes M3M4. Therefore: If r1 == 0, we have M2M3, so we have M1M2M3M4, so r2 == 1. If r2 == 0, we have M4M1, so we have M3M4M1M2, so r1 == 1. If r1 == 1, we have M3M2, but M4 is not ordered with M1, so r2 == 0 OR r2 == 1. If r2 == 1, we have M1M4, but M2 is not ordered with M3, so r1 == 0 OR r1 == 1. Listing 2-5. Property P1-Instruction Reordering. Tile 0 sw [x]=1 //M1 MF //M2 sw [y] = 1 // M3 | | | | Tile 1 lw r2 = [y]//M4 bbs r5, foo lw r3 = [x]//M6 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 27 Chapter 2 Tile Processor Here, r2 == 1, r3 == 0 is a legal outcome. M6 is dependant on the branch, however the branch is not dependent on M4. Therefore, there is no dependency between M4 and M6 and they can be reordered. Specifically, M4 may miss in the cache. While the miss is outstanding, the branch and M6 both execute, and M6 hits in the cache, writing r3 == 0. Then, the stores on Tile 0 execute and M4 gets the new value of y (1). Listing 2-6. Property P2—Store Atomicity and Write Bypassing. Local data dependencies do not establish global visibility ordering: processors can see their own writes early. Tile 0 sw [x] = 1 //M1 lw r1 = [x] //M2 sw [y] = r1 // M3 | | | | Tile 1 lw r2 = [y]//M4 MF //M5 lw r3 = [x]//M6 The following is a legal outcome: r1 == r2 == 1, r3 == 0. In this case, true data dependencies on Tile 0 cause M1, M2, and M3 to EXECUTE on Tile 0 in order. However, this does not imply that they become globally visible to Tile 1 in this order. The above outcome could occur if Tile 0 bypassed the sw to x to the lw x through a write buffer or local cache. Now, operation M3 writes memory, and operation M4 observes the write M3, but operation M6 gets to memory before operation M1 has become globally visible. To avoid the local bypass, Tile 0 should issue a MF instruction between M1 and M2. This forces M1 to become globally visible before M3. Listing 2-7. Property P2—Store Atomicity and Write Bypassing. Local data dependencies establish local ordering. Tile 0 sw [x] = 1 //M1 MF //M2 sw [y] = x //M3 | Tile 1 | lw r1 = [y] // M4 | lw r2 = [r1] //M5 r1 == x and r2 == 0 is an illegal outcome. M5 is data dependent on M4 and thus executes (and becomes locally visible) after M4. Listing 2-8. Property P2—Store Atomicity and Write Bypassing. Stores have a single order as observed by remote processors. Tile 0 sw [x] = 1 //M1 | | | | Tile 1 lw r1 = [x] //M2 MF lw r2 = [y] //M3 | Tile 2 | sw [y] = 1 //M4 | | | | | | Tile 3 lw r3 = [y] //M5 MF lw r4 = [x] //M6 r1 == 1, r3 == 1, r2 == 0, r4 == 0 is an illegal outcome. If the above outcome were legal, this would imply that Tile 3 observes M4 occurring before M1 and Tile 1 observes M1 occurring before M4. More formally, Tile 1 observes: M1 M2 M3 M4. While Tile 3 observes: M4 M5 M6 M1. Recalling property P2 of the consistency model, it should be noted that because a store from a given processor occurs atomically as observed by remote processors, the above outcome is illegal. 2.5 TILE-Gx Page Attribute Transitions and Cache Flushes There are two usage models for page flushing: • 28 Page attribute transitions, for example changing which cache is the home for a particular page of memory. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Protection • User-managed shared memory For TILE-Gx, pages can either be flushed via cache displacement flushes or by targeted page based flushing. It has the same behavior in the operation of MF on TILE-Gx versus TILE64: • MF no longer ensures that victims are visible • MF no longer ensures istream operations (demand or prefetch) are visible 2.6 Protection This section discusses protection levels, also referred to as access or privilege levels. This topic is introduced at this point since it applies to every Special Purpose Register (SPR) and is critical to using the SPRs. 2.6.1 Levels of Protection The Tile architecture contains four levels of protection. The protection levels are a strict hierarchy, thus code or a hardware mechanism executing at one protection level is afforded all of the privileges of that protection level and all lower protection levels. The protection levels are numbered 0-3 with protection level 0 being the least privileged protection level and 3 being the most privileged protection level. Table 2-15 presents an informal mapping from a protection level number to names. This specification refrains from formally defining names for the four different protection levels (because other protection schemes different from the example used here are possible) but informally defines one possible name mapping. Table 2-15. Informal Protection-Level Name Mapping 2.6.2 0 User 1 Supervisor 2 Hypervisor 3 Virtual Machine Level Protected Resources The Tile architecture contains several categories of protection mechanisms. These protection mechanisms include: • Preventing illegal instruction execution • Preventing instructions from injecting into or reading from selected networks • Memory protection via multiple translation lookaside buffers (TLBs) • Negotiated Application Programmer Interfaces (APIs) for physical device multiplexing • Controlling what protection level to which an interrupt traps 2.7 Interrupt Model 2.7.1 Introduction Interrupt and exceptions are conditions that cause an unexpected change in control flow of the currently executing code. Interrupts are asynchronous to the program; exceptions are caused directly by execution of an instruction. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 29 Chapter 2 Tile Processor Interrupts and exceptions interject themselves into executing code. In the Tile architecture™, there is a wider variety of devices than in a standard processor. The interrupt/exception structure of the Tile architecture is distributed and tiled in the same manner as the rest of the design. An interrupt or exception that occurs is only reported to the local tile to which it is relevant. By localizing the reporting, no global structures or communication are needed to process the interrupt. If a local interrupt needs to be reported to a remote location, it is the operating system’s responsibility to communicate that need via one of the architecture’s inter-tile communication mechanisms. The interrupt/exception structure of the Tile architecture is tightly integrated with the protection model. The architecture contains a Minimum Protection Level (MPL) for each possible interrupt or exception that can occur. The MPL is a dual use mechanism used to indicate a minimum protection level required to take some action in the processor without generating an exception. It can also indicate the protection level at which that the corresponding interrupt or exception handler executes. Some exceptions occur regardless of protection level. Examples of these are TLB misses and illegal instruction exceptions. For exceptions that occur regardless of protection level, if the current protection level (CPL) is less than the MPL for the corresponding exception, the exception occurs at the MPL for the exception. If the CPL is greater than or equal to the MPL for the corresponding exception, then the exception is executed at the current protection level. For a complete list of interrupts and exceptions, see Table 2-16, “Interrupt and Exception List,” on page 34. The Tile architecture uses a vectored approach to interrupts/exceptions; there are four sets of vectors, one for each protection level. On an interrupt or exception, the architecture changes the program counter to a value derived from the interrupt/exception number and the protection level at which that the handler is to execute. The offset is value in the INTERRUPT_VECTOR_BASE SPR for the protection level, plus the protection level multiplied by 16 MB (0x01000000), plus the interrupt/exception number multiplied by 256. This allows 32 Very Long Instruction Word (VLIW) instructions to reside in each vector, and allows all of a protection level’s handlers and up to 16 MB of accompanying code to be mapped into virtual address space using one large-page ITLB entry. If more than 32 instructions are needed to handle an interrupt/exception, the code can jump to the rest of the handler located in that same large page, or anywhere else in the address space. 2.7.1.1 Interrupt/Exception State When an interrupt or exception occurs, location information (identifying where the interrupted program is currently executing) must be saved. This feature is designed to allow the return from the interrupt to the exact location that the interrupt/exception occurred. It allows the handler to know precisely which instruction caused the exception (if it is caused by a fault in the processor’s instruction stream). In order to track this location (state) information, hardware in the processor must save and restore state atomically through a mechanism. The state that needs to be saved and restored is the program counter of the interrupted process and the protection level of the interrupted process. The program counter, protection level, and INTERRUPT_CRITICAL_SECTION status register of the interrupted process are stored in a exceptional context, abbreviated as EX_CONTEXT. The EX_CONTEXT for a given protection level is stored in a pair of Special Purpose Registers (SPRs), with a separate register pair dedicated to that protection level. On interrupt/exception, the interrupted process’s state and protection level is stored in the EX_CONTEXT that corresponds to the protection level at which that the handler is to be executed. This state is likewise utilized to return via the IRET instruction. On a return from interrupt/exception, the content of the current protection level’s EX_CONTEXT is copied into the current machine’s program counter and state. 30 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Interrupt Model 2.7.1.2 Nested Interrupts/Exceptions The Tile architecture takes the approach of pushing the bulk of the interrupt/exception processing into software. By doing this, an interrupt or exception can occur within an already-executing handler. This introduces a set of problems when an interrupt or exception occurs within a handler at the same protection level. Primarily, the EX_CONTEXT has no place to be saved. An example will illustrate the problem with nested interrupts. Process A is executing at protection level 0 (lowest protection level). An interrupt of protection level 1 occurs, which saves the program counter and protection level of 0 into the EX_CONTEXT state of protection level 1. Now interrupt handler B begins to execute to resolve the interrupt. While interrupt handler processing at protection level 1 is executing, the interrupt handler takes some action causing an exception to occur at protection level 1, the same protection level at which that the interrupt handler B is executing. The hardware now replaces the EX_CONTEXT with the level 1 and program counter of interrupt handler B while it begins to execute exception handler C. Unfortunately, when exception handler C returns, interrupt handler B is returned too, but the content of EX_CONTEXT is not the same as when the control flow left interrupt handler B. In effect the context information of process A was never saved into reliable storage, and was irrecoverably lost. To combat this problem, the Tile architecture utilizes the INTERRUPT_CRITICAL_SECTION status bit. The INTERRUPT_CRITICAL_SECTION status bit indicates if a process is in the middle of a critical section of a handler, that is any time the handler is prevented from processing another interrupt or exception. When an interrupt/exception occurs, the INTERRUPT_CRITICAL_SECTION bit is set, which indicates that the handler cannot be interrupted. If a non-masked interrupt is at the same protection level as the current handler, the INTERRUPT_CRITICAL_SECTION bit is set, and a double fault interrupt is taken instead. If the MPL of the double fault interrupt is greater than the CPL, the current system state is saved in the target PL’s EX_CONTEXT. This is true for any interrupt. However, if the MPL of the double fault interrupt is less than or equal to the CPL, then the double fault handler executes at the CPL, and its EX_CONTEXT is not modified. In either case, the double fault handler will most likely dump the state of the machine and halt. It may also ask a higher level supervisory layer to dump its state and halt the currently running process. For more information on double faults, see “Double Faults” on page 39. It is the responsibility of the writer of an interrupt/exception handler to see that no exceptions can occur inside its critical section. One of the key implications of this is that memory accesses while in the critical section of the interrupt handler must not cause a TLB miss. This is actually only a problem if the TLB miss handler is executing at the current protection level (PL). Once the EX_CONTEXT is stored in memory, typically in some kernel stack structure or per interrupt state, the processor can unset the INTERRUPT_CRITICAL_SECTION and handle interrupts/ exceptions normally. It is expected that long-running handlers will quickly store away the EX_CONTEXT state and then deassert the INTERRUPT_CRITICAL_SECTION bit so they can use mapped memory and allow other interrupts/exceptions to occur normally. Under normal circumstances, the INTERRUPT_CRITICAL_SECTION bit will be reasserted shortly before execution of IRET so that the handler can safely restore the EX_CONTEXT. If an interrupt or exception occurs at a higher protection level, regardless of the INTERRUPT_CRITICAL_SECTION status bit, the higher level’s EX_CONTEXT is filled in with the current PC, protection level, and value of INTERRUPT_CRITICAL_SECTION status bit of the interrupted handler. No state information is lost because of the per protection level EX_CONTEXTs. 2.7.1.3 Interrupt Traits The Tile architecture refers to all types of exceptions, traps, and interrupts as interrupts. As described above, interrupts and exceptions are similar, but can have different traits. Exceptions are caused by instructions and are synchronous to the program. Examples are an illegal instrucTile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 31 Chapter 2 Tile Processor tion, or a load that takes a DTLB miss. Interrupts are caused by events outside of the program flow, and are asyncronous. Examples are timer interval, or I/O device completion notification (via Inter-Processor interrupt). It is the responsibility of an implementation to ensure timely delivery of asynchronous interrupts, but it is up to the discretion of the implementation to deliver the interrupt at a convenient time as long as it guarantees that it will not be forever delayed. When entering INTERRUPT_CRITICAL_SECTION mode on an interrupt or exception, it is desirable to mask some interrupts automatically until the processor exits the critical section. To accomplish this goal, interrupts are automatically masked when the INTERRUPT_CRITICAL_SECTION status bit is set. Note that being in the critical section at a particular protection level does not mask an interrupt from occurring if its MPL is higher than the CPL. This allows a higher level operating system to handle a lower level’s fault even within a critical section of a lower level’s interrupt handling routine. Interrupts can be masked by bits set in the INTERRUPT_MASK_REGISTERs. 2.7.1.4 Interrupt Masks The INTERRUPT_MASK_X registers consist of a special purpose register per protection level that controls the masking of the system’s interrupts. Each bit in the mask registers correspond to a particular interrupt. An interrupt is masked off by setting a corresponding bit in the INTERRUPT_MASK_X registers to 0 where X is the desired protection level (0 through 3). An interrupt is unmasked if the mask bit is set to a 1. When a process is in an interrupt critical section as denoted by the INTERRUPT_CRITICAL_SECTION status bit, additional interrupts may be masked. If the MPL for the interrupt is less than or equal to the current protection level, then the interrupt is masked when the INTERRUPT_CRITICAL_SECTION status bit is set. Figure 2-14 presents how the protection level, interrupt masking, and an interrupt critical section come together to signal whether an interrupt occurs. This figure presents the path for a single interrupt. Where “I” indicates if the item is a real interrupt or simply an interrupt number used to indicate information about the protection domain (such as WORLD_ACCESS and BOOT_ACCESS and so on). This flow is duplicated for each interrupt. A priority encoder determines the highest priority interrupt occurring on a given cycle. Interrupts that are masked for any reason are delivered when they are unmasked. To clear an interrupt, some action specific to that interrupt must be taken. 2.7.1.5 INTCTRL and Protection of Interrupt Masks In order to modify the interrupt related state for protection level X, the executing code’s CPL must at least be that of INTCTRL_X. The Tile Architecture contains four INTCTRL MPLs that protect the various interrupt-related SPRs. Namely, for protection level X INTCTRL_X protects the interrupt masking, exceptional context, and system save SPRs. More specifically, INTCTRL_X protects SPRs EX_CONTEXT_X, SYSTEM_SAVE_X_[0,1,2,3], INTERRUPT_MASK_X, INTERRUPT_MASK_SET_X, and INTERRUPT_MASK_RESET_X. Users can change the protection level needed to access this state to allow the virtualization of this state and facilitate downcall virtualization (refer to “Downcalls” on page 41). The default configuration of INTCTRL_X registers sets their MPLs to X itself. This configuration allows the interrupt masks to be accessed at the namesake protection level. Setting the INTCTRL_X MPL to PL numerically lower than X while architecturally allowed does not make much sense as this would allow lower privileged code to modify the interrupt masks of higher privileged code. Setting the PL higher enables the virtualization of these SPRs. 32 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Interrupt Model CPL MPL[i] >= ICS Interrupt ‘i’ signaled 1 0 1 CM[i] Interrupt ‘i’ signaled Interrupt Mask Generation Figure 2-14: Interrupt Signal A more detailed view of the interrupt mask generator is shown in Figure 2-15. IM_0[i] IM_1[i] Interrupt Mask for interrupt ‘i’ IM_2[i] IM_3[i] CPL Figure 2-15: Interrupt Mask Generator 2.7.1.6 VLIW and Interrupts The Tilera® Tile architecture is a VLIW architecture. A complete VLIW instruction bundle is atomically executed or not executed. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 33 Chapter 2 Tile Processor 2.7.2 Interrupt and Exception List In Figure 2-16, the lowest number indicated in the Number field has the highest priority. Table 2-16. Interrupt and Exception List Interrupt Number 34 Name Interrupt or Exception? Short Name 0 Memory Error I MEM_ERROR 1 Single Step 3 E single_step_3 2 Single Step 2 E single_step_2 3 Single Step 1 E single_step_1 4 Single Step 0 E single_step_0 5 IDN Complete E IDN_COMPLETE 6 UDN Complete E UDN_COMPLETE 7 ITLB Miss E ITLB_MISS 8 Illegal Instruction E ILL 9 General Protection Violation E GPV 10 IDN Access E IDN_ACCESS 11 UDN Access E UDN_ACCESS 12 Software Interrupt 3 E SWINT_3 13 Software Interrupt 2 E SWINT_2 14 Software Interrupt 1 E SWINT_1 15 Software Interrupt 0 E SWINT_0 16 Illegal Translation E ill_trans 17 Unaligned Data E UNALIGN_DATA 18 DTLB Miss E DTLB_MISS 19 DTLB Access Error E DTLB_ACCESS 20 IDN Firewall Violation I IDN_FIREWALL 21 UDN Firewall Violation I UDN_FIREWALL 22 Tile Timer I TILE_TIMER 23 Auxiliary Tile Timer I AUX_TILE_TIMER 24 IDN Timer I IDN_TIMER 25 UDN Timer I UDN_TIMER 26 IDN Available I IDN_AVAIL 27 UDN Available I UDN_AVAIL 28 Interprocessor Interrupt 3 I ipi_3 29 Interprocessor Interrupt 2 I ipi_2 30 Interprocessor Interrupt 1 I ipi_1 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Interrupt Model Table 2-16. Interrupt and Exception List (continued) Interrupt Number 2.7.3 Name Interrupt or Exception? Short Name 31 Interprocessor Interrupt 0 I ipi_0 32 Performance Counters I PERF_COUNT 33 Auxiliary Performance Counters I AUX_PERF_COUNT 34 Interrupt Control 3 I INTCTRL_3 35 Interrupt Control 2 I INTCTRL_2 36 Interrupt Control 1 I INTCTRL_1 37 Interrupt Control 0 I INTCTRL_0 38 Boot Access 39 World Access I WORLD_ACCESS 40 Instruction ASID I I_ASID 41 Data ASID I D_ASID 42 Double Fault I DOUBLE_FAULT BOOT_ACCESS Interrupt State, Control Registers, Double Faults, and IRET 2.7.3.1 Interrupt State and Control Registers The interrupt state registers are mapped, that is, they maintain specific addresses in memory as part of the architecture’s special purpose register space. EX_CONTEXT The EX_CONTEXT is vital to the interrupt process. EX_CONTEXT is an abbreviation for exceptional context. An EX_CONTEXT is provided for each of the architecture’s four protection levels. Each EX_CONTEXT consists of two Special Purpose Registers. The EX_CONTEXT registers are named EX_CONTEXT_0_0, EX_CONTEXT_0_1, EX_CONTEXT_1_0, EX_CONTEXT_1_1, EX_CONTEXT_2_0, EX_CONTEXT_2_1, EX_CONTEXT_3_0, and EX_CONTEXT_3_1. The first appended number corresponds to the protection level and the second number identifies if it is the first or second word of the EX_CONTEXT. The first word of an EX_CONTEXT, EX_CONTEXT_X_0 contains the exceptional program counter (PC). The second word of an EX_CONTEXT, EX_CONTEXT_X_1 contains the protection level (PL) and the exceptional INTERRUPT_CRITICAL_SECTION status bit. The diagrams in Figure 2-16 through Figure 2-23 show the bit locations of an EX_CONTEXT. Interrupt Mask Register The INTERRUPT_MASK_X_X registers allow a program to mask out interrupts. X indicates the Protection Level number. This set of registers includes: INTERRUPT_MASK_0, INTERRUPT_MASK_1, INTERRUPT_MASK_2, and INTERRUPT_MASK_3. See “Interrupt Masks” on page 32 for more details on the control of the Interrupt Mask Register. Interrupt Critical Section Status Register The Tile architecture contains a single global INTERRUPT_CRITICAL_SECTION status bit [Bit 0] in the INTERRUPT_CRITICAL_SECTION register (see Figure 2-27). There are no restrictions on changing this bit. This bit is active high and indicates if the system is in an interrupt critical section (if Bit 0 is set to 1). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 35 Chapter 2 Tile Processor Further, this status register controls whether or not data is copied into an EX_CONTEXT when an interrupt is signaled. This status bit is set when an interrupt or exception occurs. When a valid IRET is issued, the state found in EX_CONTEXT_CPL_X.ICS is copied to the INTERRUPT_CRITICAL_SECTION register. Last Interrupt Reason Register Double fault handlers need to determine what is happening when a DOUBLE_FAULT interrupt occurs. In order to make this determination, the TILE architecture supports an SPR that contains the last two interrupt/exception reasons. The last two interrupt reasons are saved in a single LAST_INTERRUPT_REASON SPR register (Figure 2-27). The LAST_INTERRUPT_REASON is shifted over whenever an interrupt occurs, with the new interrupt reason being shifted in. On a double fault, instead of a double fault being registered in the LAST_INTERRUPT_REASON register, the highest priority interrupting reason that caused the double fault is stored. On reset, the LAST_INTERRUPT_REASON registers are reset to the double fault interrupt number. This register is protected by the DOUBLE_FAULT MPL to prevent unauthorized access. For more information on double faults, see “Double Faults” on page 39. Note that the following are sample register diagrams. For a complete list of register descriptions, see Table 8-1, “Special Purpose Registers,” on page 106. EX_CONTEXT_0_0 3& 5HVHUYHG[ Figure 2-16: EX_CONTEXT_0_0 Register Diagram EX_CONTEXT_0_1 3/ ,&6 5HVHUYHG[ Figure 2-17: EX_CONTEXT_0_1 Register Diagram EX_CONTEXT_1_0 3& 5HVHUYHG[ Figure 2-18: EX_CONTEXT_1_0 Register Diagram 36 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Interrupt Model EX_CONTEXT_1_1 3/ ,&6 5HVHUYHG[ Figure 2-19: EX_CONTEXT_1_1 Register Diagram EX_CONTEXT_2_0 3& 5HVHUYHG[ Figure 2-20: EX_CONTEXT_2_0 Register Diagram EX_CONTEXT_2_1 3/ ,&6 5HVHUYHG[ Figure 2-21: EX_CONTEXT_2_1 Register Diagram EX_CONTEXT_3_0 3& 5HVHUYHG[ Figure 2-22: EX_CONTEXT_3_0 Register Diagram EX_CONTEXT_3_1 3/ ,&6 5HVHUYHG[ Figure 2-23: EX_CONTEXT_3_1 Register Diagram Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 37 Chapter 2 Tile Processor INTERRUPT_MASK_X_0 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 5HVHUYHG[ 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 5HVHUYHG[ Figure 2-24: INTERRUPT_MASK_0 Register Diagram 38 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Interrupt Model INTERRUPT_MASK_X_1 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 5HVHUYHG[ 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 0$6.B 5HVHUYHG[ Figure 2-25: INTERRUPT_MASK_X_1 Register Diagram INTERRUPT_CRITICAL_SECTION ,&6 5HVHUYHG[ Figure 2-26: INTERRUPT_CRITICAL_SECTION State Register Diagram 2.7.3.2 Double Faults A double fault occurs when an unmasked interrupt or exception occurs at the current protection level while the INTERRUPT_CRITICAL_SECTION status bit is set. This is, in effect, an interrupt inside of an interrupt handler critical section. Because the current protection level’s EX_CONTEXT state has not been saved to memory at this point, the EX_CONTEXT will not be overwritten. Double faults are typically due to programmer error as critical sections of interrupt handlers should not take an interrupt or exception. Note that interrupts are implicitly masked when the INTERRUPT_CRITICAL_SECTION status bit is set. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 39 Chapter 2 Tile Processor LAST_INTERRUPT_REASON /$67B5($621 5HVHUYHG[ /$67B/$67B5($621 5HVHUYHG[ Figure 2-27: LAST_INTERRUPT_REASON State Register Diagram When a double fault occurs, the LAST_INTERRUPT_REASON register is shifted eight bits to the left and filled in with the highest priority interrupt/exception causing the double fault. The last two interrupt/exception reasons are tracked. The double fault interrupt handler can inspect the LAST_INTERRUPT_REASON register to determine if it is possible to recover from the double fault and to provide debugging information about where the error occurred. 2.7.3.3 IRET The IRET instruction is used to signal the end of an interrupt/exception routine. The IRET instruction atomically copies the exception context state to the program counter (PC), current protection level (CPL), and the INTERRUPT_CRITICAL_SECTION status bit. An IRET atomically takes the following actions. First, it verifies that the privilege level that the IRET will be returning to, which is stored in the EX_CONTEXT_CPL_1.PL, is less than or equal to the current protection level. If not, an IRET_ERROR general protection violation occurs. Next the machine copies the program counter (PC) from the current protection level’s EX_CONTEXT_CPL_0 into the machine’s PC. Next the machine copies the current protection level’s EX_CONTEXT_CPL_1.ICS status bit into the global INTERRUPT_CRITICAL_SECTION status bit. Lastly, the current protection level (CPL) is updated to be EX_CONTEXT_CPL_1.PL hence restoring the protection level. 2.7.4 Interprocessor Interrupt (IPI) I/O device and Tile-to-Tile interrupts are delivered to Tile software via the Tile Interprocessor Interrupt (IPI) mechanism. Each Tile has four IPI MPLs, each with 32 interrupt events. I/O interrupts have programmable bindings in their MMIO register space, which specify the target Tile, interrupt number (also referred to as the IPI Minimum Protection Level or IPI MPL), and event number. System software can choose to share Tile interrupt event bits among multiple I/O devices or dedicate the interrupt bits to a single I/O interrupt. Interrupt bits can also be shared between I/O and Tile-to-Tile interrupts. I/O devices implement interrupt status and enable bits to allow interrupt sharing and coalescing. 2.7.5 Distributed Interrupt Processing The interrupt model on the Tile architecture is a distributed model. Any interrupt that gets signaled only occurs on a single tile. Each tile may receive all of the interrupts laid out in this section. The architecture only provides for local interrupt delivery; thus, if an operating system would like to deliver an interrupt to a tile on which that the interrupt did not occur, some form of software tile-to-tile communication is needed. This communication can be through a Tile-to-Tile IPI or through memory. One example implementation may be where all of the system code executes on one tile. In effect this is the system tile. Then an interrupt occurs on a tile that is not the system tile. In this example, the designer wants to slim down the system code that is running on the slave, non-system, tiles. Unfortunately, there must still be at least a system stub running on the slave tile. This stub han- 40 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Interrupt Model dles the interrupt and communicates the fact that an interrupt occurred on the slave tile to the system tile. After this communication, the slave tile waits for further instruction from the system tile. This in effect allows a parallel processing model with one centralized system copy. 2.7.6 Proxying Interrupts The ring or hierarchical protection model that the Tile architecture provides can be too restrictive for some protection systems. To allow for more flexible protection systems, it might be required that an interrupt that was delivered to a high protection level may have to be proxied down to a lower protection level’s protection handler. This is termed a proxied downcall. To accomplish a proxied downcall, the higher level process copies the EX_CONTEXT state from the current protection level to the EX_CONTEXT for the protection level that will be receiving the downcall. Next it writes the EX_CONTEXT_CPL state with the protection level of the downcall, the PC of the downcalls interrupt vector location, and sets the INTERRUPT_CRITICAL_SECTION status bit in the EX_CONTEXT_CPL state. Lastly it executes an IRET thus mimicking an interrupt entry to the downcalled interrupt vector. 2.7.7 Lower Protection Level Interrupts Asynchronous interrupts may happen at any time. They may even occur inside of code running at a higher protection level than the level that is desired to handle the particular interrupt. The architecture provides two manners for dealing with these interrupts. First, if the higher level system code does not want to be interrupted by the lower protection level’s interrupt, it may simply mask the interrupt. The interrupt will be delivered when it is unmasked by restoring the interrupt mask for the lower protection level. Second, if the higher level protection level code would like to be interrupted by a particular interrupt, it should leave the interrupt unmasked. Now when the interrupt arrives, the corresponding interrupt handler for the higher level system code will be executed. It will be the responsibility for this code to proxy the call if needed to the lower level operating system. The INTCTRL interrupt provides a convenient mechanism to deliver these interrupts when the higher protection level code completes. 2.7.8 Downcalls While increasing the CPL is the most common way to request a service, in some situations you might want to decrease the CPL to accomplish a task instead. In effect, this delegates processing of an interrupt to code running at a lower protection level. For instance, on the Tile architecture, a hypervisor might want to handle the Double Fault interrupt, to detect a faulty supervisor; however, if that interrupt were generated by an application program, it might want to allow the supervisor to handle it instead. In many cases, this is quite easy to do. If an interrupt occurs at CPL 0 that is handled at CPL 2, and the interrupt service routine then decides that it should be handled at CPL 1 instead, it would perform the following algorithm, which we will term a downcall: 1 First, it reads the contents of exception context 2, and writes it to exception context 1; this will be the PC of the interrupted code, a CPL of 0, and whatever the Interrupt Critical Status (ICS) state was at the time of the interrupt. 2. Next, it writes exception context 2 with a PC of the desired interrupt handler in PL 1, a CPL of 1, and an ICS state of true. 3. Lastly, it executes an IRET instruction. After the return from interrupt, the PL 1 interrupt service routine is executed, and it begins with the exact same state it would have had if the interrupt had gone to it originally; when it is done, it returns from the interrupt and the originally-executing code is resumed. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 41 Chapter 2 Tile Processor If you want to delegate handling of an interrupt to the PL at the point where the interrupted code was running, it can be more difficult. For instance, a supervisor might get an I/O interrupt, which it would like to delegate to an application program, since the specific I/O device is owned by that application. If the interrupted code was not in an interrupt-critical section, and the interrupt of interest is not masked at the delegatee PL, returning to the lower-level interrupt routine can be done as described above. However, if the interrupted code was in an ICS, that procedure would destroy the state in the lower PL’s exception context; if the to-be-delegated interrupt is masked. That procedure could destroy application state which is being protected by that mask. There are two ways to deal with this problem. One way is to virtualize the ICS state bit and the interrupt masks for the lower PL by raising the MPL that controls access to their associated special-purpose registers. The delegating code then returns to the previously-executing code at the lower PL. When that code accesses the ICS or interrupt mask SPRs, the delegating code’s GPV interrupt handler can emulate the appropriate instruction. When one of those registers accesses code, it clears the ICS bit, or unmasks the relevant interrupt. After emulating the instruction, the GPV handler can reset the MPL to its original value and then downcall to the delegated interrupt routine. Another method requires the cooperation of the delegated-to code, only works when the delegatee is in an interrupt critical section, but is somewhat easier to implement. With this method, the delegator arranges for the delegatee to get an notification, via an interrupt, indicating that it should make a special service request to the delegator. Since that notification comes via an interrupt, it will not be delivered until the delegatee exits the critical section. The delegatee’s interrupt routine is then executed, and it makes the special service request to the higher PL. That PL performs the downcall, but does not modify the delegatee PL’s exception context; when the delegatee interrupt routine exits, it returns to the code that was interrupted by the notification interrupt. To enable this second method, the Tile architecture provides four software-triggerable interrupts, each of which can be targeted at any PL by setting an associated MPL register. Each interrupt is asserted when a corresponding special purpose register is written with a 1. These interrupts are the Interrupt Control [0:3] interrupts. In order to have these interrupts fire, the corresponding INTCTRL_X_STATUS registers should have their low bit set. When this bit is set it allows software to control when an interrupt fires. Refer to the descriptions of the INTCTRL_X_STATUS registers that follow. Interrupt Control 0 Status Register (INTCTRL_0_STATUS) This register is used to specify the interrupt control 0 interrupt. Speed Slow Minimum Protection Level INTCTRL_0 ,17&75/BB67$786 5HVHUYHG[ Figure 2-28: INTCTRL_0_STATUS Register Diagram 42 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Interrupt Model Table 2-17. INTCTRL_0_STATUS Register Bit Descriptions Bits Name Reset Description 63:1 Reserved 0 Reserved 0 INTCTRL_0_STATUS 0 This field specifies the interrupt control 0 interrupt. Interrupt Control 1 Status Register (INTCTRL_1_STATUS) This register enables the interrupt control 1 interrupt. Speed Slow Minimum Protection Level INTCTRL_1 ,17&75/BB67$786 5HVHUYHG[ Figure 2-29: INTCTRL_1_STATUS Register Diagram Table 2-18. INTCTRL_1_STATUS Register Bit Descriptions Bits Name 63:1 Reserved 0 INTCTRL_1_STATUS Reset Description Reserved 0 This register enables the interrupt control 1 interrupt. Interrupt Control 2 Status Register (INTCTRL_2_STATUS) This register enables the interrupt control 2 interrupt. Speed Slow Minimum Protection Level INTCTRL_2 ,17&75/BB67$786 5HVHUYHG[ Figure 2-30: INTCTRL_2_STATUS Register Diagram Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 43 Chapter 2 Tile Processor Table 2-19. INTCTRL_2_STATUS Register Bit Descriptions Bits Name 63:1 Reserved 0 INTCTRL_2_STATUS Reset 0 Description This register enables the interrupt control 2 interrupt. Interrupt Control 3 Status Register (INTCTRL_3_STATUS) This register enables the interrupt control 3 interrupt. Speed Slow Minimum Protection Level INTCTRL_1 ,17&75/BB67$786 5HVHUYHG[ Figure 2-31: INTCTRL_3_STATUS Register Diagram Table 2-20. INTCTRL_3_STATUS Register Bit Descriptions Bits Name 63:1 Reserved 0 INTCTRL_3_STATUS Reset Description Reserved 0 This register enables the interrupt control 3 interrupt. 2.8 Software-Visible Dynamic Networks 2.8.1 Overview The dynamic networks provide packet-based communication between Tiles, I/O devices, and Memory. The Tile Architecture™ provides two dynamic networks for direct software access, the User Dynamic Network (UDN) and the I/O Dynamic Network (IDN). The UDN is typically used for application level communication while the IDN is typically used for operating system, I/O, and hypervisor communications. A specific implementation of the Tile Architecture may employ additional dynamic networks for hardware-based communication between Tiles, I/Os, and/or memory. Cache coherency and memory operations, for example, can use dedicated dynamic networks. The hardware usage of implementation specific dynamic networks is not defined in this document. 2.8.1.1 Register Mapping and Interlock The UDN and IDN are directly accessible by the Tile’s Arithmetic Logic Unit (ALU). The networks are register mapped making them highly integrated with the program flow. This design provides low latency and low overhead access for network reads and writes. For example: 44 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Software-Visible Dynamic Networks • {add udn0, r5, r6} // This will add the contents of r5 to r6 and send the result to the UDN. • {add r5, r6, udn0} // This will read a word from the UDN, add it to r6, and put the result in r5. • foo = udn0_receive(); // This .c intrinsic will dequeue a single word of data from the UDN and // store it in variable foo. Access to the UDN and IDN is fully-interlocked. This allows an application to read the network port and go to sleep automatically until data arrives. This method provides a low power state with zero latency wake up. Similarly, on a network send if the network is not able to consume the packet word immediately, the processor automatically waits until buffer space is available, which saves a considerable latency over a polling or interrupt-driven scheme. 2.8.1.2 Routing The dynamic networks are two-dimensional meshes. The sending software prepends a route header on each packet. The route header contains the X and Y location information (along the X and Y axes of the tile) of the target Tile or I/O device. Figure 2-32 shows the location information for each tile in a processor. A route decision, based on a comparison of the X and Y coordinates in the packet’s route header to the X and Y coordinates of the Tile, is made at each switch point as the packet travels from the source node to the destination node. The Tiles’ X and Y coordinates are stored based on the Tiles position in the mesh. 64-Bit Processor Register File 3 Execution Pipelines JTAG Flexible I/O 3,0 4,0 5 5,0 2,1 3,1 4,1 5,1 1,2 2,2 3,2 4,2 5,2 0,3 1,3 2,3 3,3 4,3 5,3 0,4 1,4 2,4 3,4 4,4 5,4 0,5 1,5 2,5 3,5 4,5 5,5 4x I2C SPI 0,0 1,0 2,0 0,1 1,1 1,1 0,2 2x UART 2x USB mPIPE MiCA ITLB L1 DCache DTLB L2 Cache Terabit Switch DDR3 Memory Controller (1) DDR3 Memory Controller (0) MiCA Cache L1 ICache TRIO 4x GbE 10GbE 4x GbE 10GbE 4x GbE 10GbE 4x GbE 10GbE PCIe 2.0 PCIe 2.0 PCIe 2.0 4-Lane 4-Lane 8-Lane SGMII XAUI SGMII XAUI SGMII XAUI SGMII XAUI SerDes XAUI[3] or SGMII[12:15] SerDes SerDes XAUI[2] or SGMII[8:11] XAUI[1] or SGMII[4:7] SerDes SerDes XAUI[0] or SGMII[0:3] PCIe[2] SerDes SerDes PCIe[1] PCIe[0] Figure 2-32: Tile Processor Hardware Architecture Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 45 Chapter 2 Tile Processor The routing algorithm is as follows: The X dimension is checked first: • If x value is less than the SPR value, send the packet west. 1 At tile 1,0, the tile west of it is 0,0, and so on. • If x value is greater than the SPR value, send the packet east. • If the value of x and the value of the SPR are equal, check Y. Then the Y dimension is checked as follows: • If the packet destination is less than the SPR value, send the packet north. • If the packet destination is greater than the SPR value, send the packet south. • If the packet destination and SPR values are equal, send the packet to the tile demux logic. The routing order can be changed to route the Y dimension first, followed by X dimension, by writing the MEM_ROUTE_ORDER SPR. 2.8.1.3 Demultiplexing Each packet sent to a Tile via the IDN or UDN contains an ID that is used by hardware to demultiplex multiple flows at the receiver. The UDN provides four hardware demultiplexed flows and the IDN provides two. Individual demultiplexed flows may be accessed directly using named registers: udn0, udn1, udn2, udn3, idn0, and idn1. Hardware removes the route header word at the receiver, consequently software only sees the packet data at the named registers. For more information about the UDNx registers, refer to the “Register Set”, in the TILE-Gx Instruction Set Architecture Specification (UG401). 1. Note that Special Purpose Registers can be either read or write registers. If they are read only, the SPR has a value that is stored in the register. 46 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Software-Visible Dynamic Networks UDN Network Interface N S E W UDN_ID_0 UDN_ID_1 Tag Compare UDN_ID_2 UDN_ID_3 Demux Buffer SU0 SU1 SU2 SU3 Main Processor Interface Figure 2-33: Demux 2.8.1.4 Receive-Side Buffering To prevent head-of-line blocking, per-flow buffer space is provided at the receiving tile. This allows packets to be dequeued in a different order than they were received at the switch point. The depth of the buffer varies by implementation. By storing independent flows in separate addressable queues, software can consume packet flows out of order relative to the arrival at the switch without causing head of line blocking. This undifferentiated buffer has programmable high-watermarks for UDN and IDN traffic. These watermarks provide a hard partition of the buffer between UDN and IDN flows when deterministic, non-blocking performance is required on the IDN or UDN. 2.8.2 Ordering Packets will be delivered in order for any two nodes for the same ID. Packets from different nodes or using different IDs are not ordered with respect to each other. Packets are never interleaved at the destination for a given flow. 2.8.2.1 Packet Format Packets on the dynamic network use the following format: 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 ID Dest_X Dest_Y 8 7 6 5 4 3 2 1 0 Word 0: Header Length ID (Present if F=1) Word 1 Data (1-128 Words) Data (1-128 words) Figure 2-34: Dynamic Network Packet Format Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 47 Chapter 2 Tile Processor Table 2-21. Field Descriptions Word, Bits Name Description Word0, 6:0 Length Number of 32-bit words in the packet not including the route header. 0 Indicates a 128-word packet. Word0, 17:7 Dest_Y Y coordinate of destination Tile or I/O device. Word0, 18:9 Dest_X X coordinate of destination Tile or I/O device. Word0, 63:30 Reserved Must be zero. Word1-128 Data 1 to 128 words of packet payload. Hardware in the switch points uses the route header (word0) to route the packet from source to destination. 2.8.3 Network Hardwall The switches for the UDN and IDN provide an optional hardwall for independent communication domains as well as virtualization. When an output port is protected via a bit in the U/IDN_DIRECTION_PROTECT SPR, no data will be sent out of the associated port. An interrupt allows software to detect any attempt to send traffic to a protected port. Software can handle a hardwall protection violation as follows: 1 Read U/IDN_DIRECTION_PROTECT SPR to determine which output port(s) generated the violation. 2. If multiple output ports are detecting a violation, choose one to process and use the output port’s U/IDN_SP_STATE.OP_MUX_SEL indicator to determine source port. Otherwise, if only one output port is in violation, the source port indicator can be used to determine which input port needs to be extracted. 3. Read the packet from the offending input port using the FIFO “spill” SPRsPacket length must be interpreted so that exactly the entire packet is eventually extracted (though it may require multiple trips through the ISR). 4. If an entire packet is not available, return the packet from handler, creating a trap again when subsequent words arrive. 5. If an entire packet is sent, clear the locked indicator on the output port. 2.8.4 Interrupts Interrupts are provided for the following conditions on the UDN and IDN: 2.8.5 • A data word is available on udn0/1/2/3, idn0/1 (individually maskable). • Data is available on catch-all queue. • Data is sent on the UDN/IDN ports. This interrupt is triggered after the last word has been sent. • A switchpoint hardwall violation occurs (data word attempted to be sent to a protected output port). Deadlocks The dynamic networks provide deadlock-free routing between nodes for a single traffic flow in each direction. But software can induce a deadlock at the protocol level if circular dependencies are created between sending and receiving data. 48 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Special Purpose Registers (SPRs) Because dynamic network switch points are locked once a route header has traversed the switch (packets are never interleaved), care must be taken to prevent deadlock due to a node sending a partial packet. 2.9 Special Purpose Registers (SPRs) The Tile Processor contains special purpose registers (SPRs) that are used for several reasons: • Hold state information and provide locations for storing data that is not in the general purpose register file or memory • Provide access to structures such as TLBs • Control and monitor interrupts • Configure and monitor hardware features, for example prefetching, iMesh routing options, etc. SPRs can be read and written by tile software (via mfspr and mtspr instructions, respectively), and in some cases are updated by hardware. The SPRs are grouped by function into protection domains, each of which can be set to an access protection level, called the minimum protection level (MPL) for that protection domain. The “Protection” on page 29 defines how the MPLs are used. Software that attempts to access an SPR for which it does not have the appropriate privilege level will cause a General Protection Violation (GPV) interrupt, and information will be logged into the GPV_REASON SPR, as described in “General Protection Violation Reason Register (GPV_REASON) ” on page 25. Click on the link for a complete list and detailed descriptions of SPRs (click here). 2.10Performance Counters / System Diagnostics 2.10.1 In-Tile System Devices Each Tile provides a number of system services, diagnostics, and performance monitoring capabilities to truly support a complete system within the Tile. 2.10.1.1 Tile Timer and AUX_TILE_TIMER Two 32-bit down counters with interrupts are provided in the Tile. The timers can be used for operating system level “tick” functionality or for any other timing task. The interrupt uses the TILE_TIMER minimum protection level (MPL). The timer is located in the COUNT field of the TILE_TIMER_CONTROL SPR. This counter can be disabled and can be saved/ restored during context swapping. The UNDERFLOW bit of the TILE_TIMER_CONTROL SPR indicates that the counter has wrapped from 0 to (232)-1. 2.10.1.2 Cycle Counter The Tile architecture™ provides a 64-bit free running cycle counter. The counter initializes to 0. Read access to the counter is provided by the CYCLE SPR, which is part of the WORLD_ACCESS MPL. Software can also write the counter via the CYCLE_MODIFY SPR, which is part of the BOOT_ACCESS MPL. 2.10.2 Events The performance monitoring and system debug capabilities of the Tile architecture rely on implementation-defined events. The specific set of events available to software varies depending on implementation, but examples include cache-miss, instruction bundle retired, network word sent and so on. Events are used to increment performance counters or interact with system debug/ diagnostics functionality. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 49 Chapter 2 Tile Processor 2.10.3 Counters The Tile architecture provides four 32-bit performance counters. The counters may be assigned to any one of the implementation specific events or to the other performance counter, providing 64bit counters if needed. On overflow, the counter triggers an interrupt at the PERF_COUNT MPL. Performance counters are controlled and monitored via the PERF_COUNT_0/1, PERF_COUNT_CTL, and PERF_COUNT_STS, AUX_PERF_COUNT_0/1, AUX_PERF_COUNT_CTL, and AUX_PERF_COUNT_STS SPRs. A current protection level (CPL) mask is provided in the PERF_COUNT_CTL SPR to prevent the associated performance counter from running at specific protection levels. 2.10.4 Watch Registers The Tile architecture provides programmable watch registers to track matches to implementationspecific multi-bit fields. For example, match on a specific fetch PC, or a specific speculative memory reference Virtual Address. The match comparison uses the 64-bit WATCH_MASK SPR to qualify which bits participate in the comparison. The result of the watch comparison is an event which can be selected by the performance counters. Hence an interrupt can be raised on the Nth occurrence of the watch match comparison by preloading a performance counter with (2 32)-N and selecting the SPCL.WATCH event in the performance counter. The watch register can be used to trigger a performance counter interrupt by preloading the counter with ((232)-1)-N where N is the number of matches at which an interrupt is desired. Note that the interrupt is not precise, hence it is not possible to trap the exact instruction that caused the match. Precise trapping based on VA or PC is generally provided by software debugging services such as GDB which rely on TLB protection attributes and instruction emulation to set breakpoints on specific VAs or PCs. 2.10.5 Pass SPR The Tile architecture provides a 64-bit PASS SPR that acts as a general purpose scratchpad. This SPR is typically used for higher level signaling in Tile simulators. Two additional aliases to the same PASS SPR are provided via the DONE and FAIL SPRs. On the TILE-Gx device, writes to the PASS, FAIL, and DONE SPRs are specific events that can be fed into the performance counters or the diagnostics functions. 2.10.6 Broadcast Networks TILE-Gx provides four dedicated 1-bit broadcast networks intended for use with performance monitoring and diagnostics. Each network consists of a single input wire and a single output wire on each compass point. When the input wire asserts, the Tile asserts all four of the output wires in the next cycle. Tiles assert their output wires when the event selected by the DIAG_BCST_CTL SPR occurs. Additionally, software may assert a Tile’s broadcast network by writing to the DIAG_BCST_TRIGGER SPR. The assertion of a broadcast network is an event that may be fed into the performance counters for a given Tile. Using this mechanism, up to four Tile events can be fed to other Tiles allowing six effective performance counters for a single Tile. These counters also provide a low-latency interrupt mechanism between Tiles. Note that it is not recommended to use this mechanism in implementation independent code, since the behavior is not defined by the Tile architecture. 50 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Performance Counters / System Diagnostics The broadcast networks provide a hardwall on each output port in the DIAG_BCST_MASK SPR. The hardwall can be used to prevent cycles in the broadcast networks and to partition the Tile fabric into independent domains. The broadcast networks are additionally visible and controllable in the Rshim via the DIAG_BROADCAST on channel-0. 1 2.10.7 System Software Debug Debugging of applications software typically uses industry standard methods such as the GNU debugger (GDB). However, when debugging low level hypervisor or system software, it is sometimes necessary to extract processor state data without the assistance of cooperating debugger software running on the tile. 2.10.7.1 Tile Debug Port To aid with system software debugging, TILE-Gx provides access to essential processor state data such as fetch-PC, registers, and specific SPRs. This Tile state is readable from any of the following sources: • JTAG • USB • UART (in protocol mode – see “Protocol Mode” on page 211) • I2C Slave • Software running on another Tile PCIe Access to Tile state is via a JTAG instruction-like interface that can be addressed directly from JTAG or via a JTAG control interface located in the Rshim (see Section 15.5 Rshim JTAG for more information on the JTAG_CONTROL, JTAG_SETUP, and JTAG_DATA registers). Read/Write Access Reads and writes to the Tile debug state are managed by sending commands to the Tile via the following JTAG instruction registers: Table 2-22. JTAG Instruction Registers Name Number Size (bits) Description EXT_MODE 0xF8 2 Must be written with 0x1 to enable Tile debug. INST_SEL 0xF6 TWIDTH*8 Debug command register – one in each Tile, complete row of Tiles is concatenated. Format of this register is defined below. TWIDTH=69 TWIDTH=70 BLOCK_SEL 0xF4 17 One-hot Tile row select. Bits[16:8] are reserved and must be 0. 1. Refer to the “Glossary” on page 703 for a definition of Rshim. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 51 Chapter 2 Tile Processor The format of the INSTSEL command shifted into each Tile is as follows: (TWIDTH-1):55 54 R must be 0 CMD 0: NOP 1: Read 2: Write 53 52 48 OBJ_SEL Object within the tile being accessed. 47 32 OFFSET Offset within the structure being accessed. 31 0 DATA Shifted in for writes, shifted out for reads. This command is shifted in to the West-most Tile in the row selected by BLOCK_SEL and then through the rest of the Tiles in the row and finally out of the East-most Tile and back to the JTAG controller. Each Tile has its own 69-bit command register so 69*8 bits must be shifted in with the non-accessed Tiles’ CMD fields set to 0. A write to the debug port on a given Tile would be handled as follows: 1 Write to the EXT_MODE register to set it to 1. 2. Set BLOCK_SEL register to indicate which Tile row is to be accessed (for example, a Tile in the 3rd row down would use BLOCK_SEL = 0x0_0008. 3. Shift in 69*8 bits to the INST_SEL register. The Set of 69 bits corresponding to the Tile to be written would have CMD=2 and OBJ_SEL/OFFSET/DATA fields set appropriately. All other Tile commands would be 0. Reads require two steps. The read command is shifted exactly as a write except CMD=1. Then a second shift of 69*8 bits is used to extract the data. 52 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Performance Counters / System Diagnostics Objects The objects listed in Table 2-23 are defined to provide Tile debug state. Note: The diagnostic access path is 16-bits wide. The LSB’s of JTAG address are used to select a 16-bit slice of wider structures, for example for a 64-bit structure JTAG[1:0] would be used to access bits [15:0] when 0, [31:16] when 1, etc. Table 2-23. Tile Debug Objects Name OBJ_SEL (in HEX) Data Width Description L1D data 0x09 144 1. JTAG address [14] selects the way1/way0 [1: way1, 0: way0]. 2. JTAG address [13:12] selects a ram, a 16-byte chunk of a cacheline. 3. JTAG address [11:4] is [L1_IDX_MSB:L1_IDX_LSB]. 4. Each index in a ram has 144 bits, the total 144x4=576 bits including 512-bit data and 64-bit parity. 5. ram0 has byte-15 to byte-0, with layout [byte-15 parity, byte15 data, byte-7 parity, byte-7 data, byte-14 parity, byte-14 data, byte-6 parity, byte-6 data….byte-0 data]. 6. ram1 has byte-31 to byte-16, with layout [byte-31 parity, byte31 data, byte-23 parity, byte-23 data, byte-30 parity, byte-30 data, byte-22 parity, byte-22 data….byte-16 data]. 7. ram2 has byte-47 to byte-32. 8. ram3 has byte-63 to byte-48. For more information refer to Section 2.4.4 Cache Micro Architecture in the Tile Processor Architecture Overview for the TILE-Gx Series (UG130). L1D tag 0x0a 54 1. JTAG address [9:2] is [L1_IDX_MSB:L1_IDX_LSB]. 2. JTAG address [1:0] is indexing 54-bit input/output. 3. Each index in the ram has 54 bits, including 2 ways with 26 bits tag and 1 bit parity each way. 4. The physical layout of the 54 bits is [way1 parity, way1 tag, way0 parity, way0 tag]. For more information refer to Section 2.4.4 Cache Micro Architecture in the Tile Processor Architecture Overview for the TILE-Gx Series (UG130). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 53 Chapter 2 Tile Processor Table 2-23. Tile Debug Objects (continued) Name OBJ_SEL (in HEX) Data Width Description L2 data 0x02 266 1. JTAG address [16:15] selects the ram [3: way7/6, 2: way5/4, 1: way3/2, way 1/0]. 2. JTAG address [14] selects the odd/even way [1: odd, 0: even]. 3. JTAG address [13:5] is [L2_IDX_MSB: L2_IDX:LSB]. 4. The object includes bottom half of the L2 data in 4 rams. 5. Each index in a ram has 266 bits, with [265:256] the ecc code, and [255:0] the bottom half of a cacheline data. For more information refer to Section 2.4.4.2 L2 Cache Subsystem in the Tile Processor Architecture Overview for the TILE-Gx Series (UG130). L2 data 0x01 266 1. JTAG address [16:15] selects the ram [3: way7/6, 2: way5/4, 1: way3/2, way 1/0]. 2. JTAG address [14] selects the odd/even way [1: odd, 0: even]. 3. JTAG address [13:5] is [L2_IDX_MSB: L2_IDX:LSB]. 4. The object includes upper half of the L2 data in 4 rams. 5. Each index in a ram has 266 bits, with [265:256] the ecc code, and [255:0] the upper half of a cacheline data. For more information refer to Section 1.6.2 RegisterFile (RF) in the Instruction Set Architecture for TILE-Gx (UG401). L2 tag 0x03 272 1. JTAG address [13:5] is [L2_IDX_MSB:L2_IDX_LSB]. 2. Output of the two rams are total 272 bits including 1-bit parity, 8-bit id, 25-bit tag each way, 8 ways total. 3. 272 bits layout: 8-way interleaving [parity, id, tag]. 4. Parity does not cover 8-bit id. For more information refer to Section 2.4.4.2 L2 Cache Subsystem in the Tile Processor Architecture Overview for the TILE-Gx Series (UG130). 54 L2 lva 0x05 64 L2 State. 1. JTAG address [10:2] is [L2_IDX_MSB:L2_IDX_LSB]. 2. L2_LVA is a two-port ram, with 1 read port and 1 write port. 3. Output of the ram is 64-bit. 4. a 64-bit layout: 8-bit LRU at [63:56], followed by 8-way interleaving [parity, 1-bit touch, 3-bit share, 1-bit dirty, 1-bit valid]. L2 directory 0x07 72 L2 Share Tracker. 1. JTAG address [13:12] selecting way-pair [7/6, 5/4, 3/2, 1/0]. 2. JTAG address [11:3] is [L2_IDX_MSB:L2_IDX_LSB]. 3. Output of the two rams are total 72 bits [odd way 36 bits, even-way 36 bits]. L2 SDN 0x04 256 SDN Ingress Buffer. 1. JTAG address [8:4] is indexing both 32-entry rams. 2. Output of the two rams are total 256 bits. L2 RTF 0x06 156 L2 Retry FIFO. 1. JTAG address [8:4] is indexing a 32-entry ram. 2. Output of the two rams are total 156 bits. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Performance Counters / System Diagnostics Table 2-23. Tile Debug Objects (continued) Name OBJ_SEL (in HEX) Data Width Description L2 MAFa 0x08 91 L2 Missing Address File. 1. JTAG address [5:3] is indexing a 8-entry MAF. 2. JTAG address [2:0] is indexing a 7-slice 15-bit data [14:0]. 3. The top bit of each slice [15] is the “not-success” signal. DMUX 0x0f 74 UDN and IDN Ingress Buffer. 1. JTAG address [10:3] is indexing both 256-entry rams. 2. Output of the two rams are total 74 bits. For more information refer to “Demultiplexing” on page 46 REGFILEa 0x11 64 1. JTAG address [7:2] is indexing 64-entry REGFILE. For more information refer to Section 1.6.2 RegisterFile (RF) in the Instruction Set Architecture for TILE-Gx (UG401). QUIECE 0x12 1. set_quiesce by writing 1, clear_quiesce by writing 0, [1] = cbox, [0] = sbox. For more information refer to “Quiesce” on page 56. SPR 0x10 64 1. JTAG address [1:0]: selecting chunk of the holding register. 2. JTAG address [15:2]: SPR index (when access-bit = 1). 3. JTAG address [16]: access-bit. = 1 to write/read the SPR; = 0 to write/read the holding register. For more information refer to Section 1.5 Special Purpose Registers in the Instruction Set Architecture for TILE-Gx (UG401) and Section 2.3.1 Special Purpose Registers (SPRs) in the Tile Processor Architecture Overview for the TILE-Gx Series (UG130). L1I datab 0x0b 0x0c 65 65 1. JTAG address[13] is the MSB of the ram address, selects the line from way 0 or 1 within the addressed set. 2. jtag_address[12:5] is the LSBs of the ram address, indexing the set (1 of 256). 3. JTAG address[4:3] selects the instruction 0 - 7 within the line. 4. JTAG address[2:0] selects the 16 bit slice from the instruction, 0->15:0, 1->31:16, etc 4-> {15'b0, odd parity}, 5, 6, 7 unused. For more information refer to Section 2.4.2 Front End Micro Architecture in the Tile Processor Architecture Overview for the TILE-Gx Series (UG130). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 55 Chapter 2 Tile Processor Table 2-23. Tile Debug Objects (continued) Name OBJ_SEL (in HEX) Data Width Description L1I tagb 0x0e 0x0d 74 74 1. JTAG address[9:3] is the set. 2. jtag_address[2:0] selects the 16 bit slice from the tags. [73:37] - way 1, [36:0] - way 0. Within each 37 bits -- [36:28] - way predict, [27] - valid, [26] odd parity, [25:0] - physical address[39:14]. For more information refer to Section 2.4.2 Front End Micro Architecture in the Tile Processor Architecture Overview for the TILE-Gx Series (UG130). a. b. Read only access. The Icache is physically and logically split as EVEN and ODD. For instructions within a given cacheline, instructions 0, 2, 4, 6 are in the EVEN DAT object and instructions 1, 3, 5, 7 are in the ODD DAT object (e.g. bit 3 of the instruction address determines which DAT object to use). For tags the same applies, however bit 6 of the instruction address determines which TAG object to use). When accessing tile objects via JTAG, it is important that the tile is quiesced. This prevents resource conflicts. Also, when extracting data from the trace buffer, software should first clear the trace-buffer-enable via a JTAG write to the tile’s trace buffer control SPR. This prevents a trace buffer write from being corrupted by a JTAG access to the trace buffer. 2.10.7.2 Quiesce In order to read Tile debug state via the debug port, the Tile must first be quiesced. This process can be handled via the broadcast networks described above or by writing the QUIESCE bit via the debug port. If the latter technique is used, software must first assign the DIAG_BCST_CTL.QUIESCE_SEL SPR to at least one of the broadcast networks to enable the Tile’s quiesce capability. This is typically done by boot software to allow debugging anytime later. A Tile that has been quiesced will cease instruction fetch operations. Dynamic network traffic will continue to pass through the Tile and multi-Tile cache coherency operations will continue to be processed. So other Tiles can continue to operate and communicate normally. 2.11Boot Processes and Data Format 2.11.1 Boot Flow The Tile Processor™ is booted by pushing boot data to the Tiles over the UDN using the Rshim’s packet generator. Thus any device that can access the Rshim can boot the chip. A level-0 boot program is built into the hardware and runs immediately following hardware reset. This program interprets the incoming boot stream according to the format (Table 2-24). Boot-capable interfaces such as USB, StreamIO, or PCIe provide a hardware means for communicating with the Rshim for booting. Since all bootable devices communicate with the Rshim for booting, the flow is essentially the same regardless of the source device. Boot data can also be “pulled” into the chip via the SPI or I2C-Master interfaces. This is controlled by the RSHIM_BOOT_CONTROL register. This register is initialized based on the BOOT_MODE strapping pins but can be overridden by software. It is NOT reset when the chip is software-reset, thus it is possible to change the boot mode and reset the chip. This is useful for operations like POST that might require different boot images and/or sources depending on POST results. For more information about the BOOT_MODE strapping pins, SROM_SPI_SCK and I2CMx_SCL, refer to the appropriate TILE-Gx data sheet for your processor. 56 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Boot Processes and Data Format Table 2-24. L0 Boot Format Field Description WORD-0 Number of words in the block, not counting the first two and last words WORD-1 64-bit address where the block should be stored WORD-2 through WORD-N+1 'N' 64-bit words of L1 boot code. WORD-N+2 Address to which the L0 boot code should jump. If 0, then another block will be read, otherwise the address to begin execution of the L1 code. 2.11.2 Chip Modes and Reset Behavior There are two special processor modes that are especially relevant to the booter: Physical memory mode and Cache-as-RAM mode. Physical memory mode allows the processor to do memory references without first needing to program virtual-to-physical address translations into the TLB. In this mode, the 40-bit physical address (PA) used for each load, store, or instruction fetch is constructed by using the low 40 bits of the 42-bit virtual address (VA) used by the processor. The processor is in physical memory mode when it emerges from reset, and all of the boot code executes in this mode; the TLB is not enabled until just before the hypervisor begins execution. Cache-as-RAM mode allows the processor to execute code and perform load and store instructions before the memory controllers have been configured, without using any external memory. This is accomplished by a very small change to the behavior of the level two cache (the L2$). While in cache-as-RAM mode: • A read to an address that is valid within the cache (a read hit) is handled normally: the data is returned to the processor. • A read to an address that is not valid within the cache (a read miss) is also handled normally: a request is sent to a memory controller, asking for the data. Typically, when running in cacheas-RAM mode, the memory controllers (and the tile SPRs that tell it which one to use for which memory region) are not configured, so such a request will never complete, causing the chip to hang. Thus, the booter never makes a memory reference that could cause a read miss. • A write to an address which is valid within the cache (a write hit) is handled normally: the data in the cache is modified. • A write to an address that is not valid within the cache (a write miss) is handled specially. Normally, such a request would result in an eviction (where the data currently in the relevant cache line, if modified, would be written to memory); a background fill (where the data in the target memory cache line would be read into the cache); and finally, a modification of the data in the cache. In cache-as-RAM mode, the first two steps are skipped. Instead, a target cacheline is picked (a field in an SPR determines which way in the cache is used for this), the data is written to that line, and its tags are set to match the newly written address. Note that this modification does not set the dirty bit on the cache line; in fact, any write to a cache line while in cache-asRAM mode clears its dirty bit. This prevents the line (which might have a completely false physical address in its tags) from ever being flushed to memory on eviction. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 57 Chapter 2 Tile Processor Like physical memory mode, the processor is in cache-as-RAM mode when it emerges from reset, and all of the boot code executes in this mode. Note that the use of an SPR to designate the way targeted on write miss means that each byte of the cache must be written to before it can be used as memory. About the Software Stack For information about the software stack, refer to Tilera Hypervisor Theory of Operation, which is available at: $TILERA_ROOT/doc/html directory. 2.11.3 Boot FIFO For a description of how the host interface is used during the boot sequence, refer to “Rshim Host Interface” on page 234. For a description of how software implements flow control for boot transactions, refer to “Boot and Rshim Regions” on page 81. 58 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 3 D OUBLE D ATA R ATE SDRAM (DDR3) I NTERFACE 3.1 Overview The TILE-Gx36™ supports two independent DDR3 memory channels. There is one memory controller for each memory channel. Each memory controller can be operated independently. The memory controller supports up to 800 MHz memory clock and 1600 MT/s data rate (PC3-12800). The memory controller supports sixty-four bits of data plus 8 bits of ECC (optional). The memory controller supports up to 16 ranks, and supports x4, x8, and x16 devices. Any tile can communicate with any memory controller through the on-chip mesh network. The reQuest Dynamic Network (QDN) is used for handling memory and MMIO requests. The Response Dynamic Network (RDN) is used for memory responses, MMIO responses, as well as interrupts. The memory controller uses a CAM-based scheduler to improve memory bandwidth and latency. A hardware memory striping mode is supported to distribute memory loads across multiple memory controllers. DRAM address mapping is configurable. The priority level for memory requests is also configurable. The memory PHY (MPHY) layer handles all the physical aspects of operations, such as DRAM interface timing and MPHY interface bring-up. Core Clock Domain DRAM Clock Domain RDN1 Retime FIFO RDN0 Retime FIFO Ingress Control Read/Write Response FIFO Ingress Path ECC MMIO QDN0 QDN1 Retime FIFO Mreq Retime FIFO Mreq Memory PHY CMD Buffer CMD Buffer Scheduler Protocol Controller DATA Buffer CMD Buffer Mesh Network Interface Egress Path Figure 3-1: Memory Controller Block Diagram Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 59 Chapter 3 Double Data Rate SDRAM (DDR3) Interface 3.2 Interfaces The memory controller is attached to the external DRAM through the memory PHY interface. The memory controller is attached to the TILEs through the network interface. 3.2.1 DDR3 Interface Major characteristics of external memory are shown in Table 3-1. Table 3-1. DDR3 Interface Characteristics Feature Description Memory clock frequency Up to 800 MHz (maximum) Bit rate Up to 1600 MT/s Data width 64 bits, plus optional 8 bits ECC Rank supported Up to 16 ranks Bank supported Up to 128 banks, optimized for 32 banks ECC support Single bit correction, Double bit detection DRAM parameters DDR3 parameters are fully programmable Voltage 1.5V or 1.25V 3.2.2 Network Interface The memory controller has two QDN network connections (QDN0 and QDN1), and two RDN network connections (RDN0 and RDN1). The networks run at the core clock frequency. The QDN0 and RDN0 are physically connected to one TILE, and the QDN1, and RDN1 are physically connected to another TILE. Any tile can send a memory request to the memory controller, through one of the two QDN network connections. The selection of the QDN network is static and is controlled by a configuration register in the TILE. The memory controller will always return a response/ACK on the same port from which that the request comes, that is a request that comes from the QDN0 network will be returned on the RDN0 network; a request that comes from the QDN1 network will be returned on the QDN1 network. The memory controller supports memory mapped I/O (MMIO). A TILE sends MMIO packets to access the configuration registers in the memory controllers for configuration, status, etc. The MMIO requests must come from the QDN0 network. Any MMIO requests coming from QDN1 network are considered errors, and the error status will be logged. After the memory requests are received from the network, the memory requests are stored in the command buffer (CMD buffer), the write data associated with write requests are stored in the data buffer (DATA buffer). The memory controller performs the link level flow control over the mesh network connections. 3.3 Data Flows Terminology egress is referred to the direction from a TILE to the external memory; ingress is referred to the direction from the external memory to a TILE. 60 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Data Flows 3.3.1 QDN Memory Read Request Flow A QDN memory read request sent by a tile arrives at a QDN port of the memory controller. A retime FIFO is used to handle the clock domain crossing between the core clock domain and the memory clock domain. The DRAM clock frequency is totally decoupled from the core clock domain. An arbiter is used to handle the arbitration between the two QDN ports. Once a QDN memory read request is selected by the arbiter, the QDN memory read request will be queued in the CMD Buffer. To optimize for memory bandwidth and latency, the memory controller uses a scheduler to service the memory requests based on external memory status, memory request attributes, and scheduling parameters. The QDN memory read request is converted to external memory commands (such as activate, precharge, read) before it is sent to the memory PHY. 3.3.2 RDN Memory Read Response Flow The memory controller constructs the header portion of an RDN read response from the read response FIFO. The memory controller assembles the data portion of an RDN read response packet from the read data returned from the external memory. After the retime FIFO, the RDN read response packet is converted back into the core clock domain. To improve the mesh network utilization, it can be configured so that no bubble cycle will be inserted / wasted in the middle of one RDN read response packet. Each QDN memory read request has a SEND_COPY attribute. Normally, the RDN memory read response packet is sent to the home tile through the network connection. If the SEND_COPY attribute is asserted, then the same read response packet will be sent to the original request tile through the other network connection. The QDN memory read request packet contains information on the location of the original requesting tile. 3.3.3 QDN Memory Write Request Flow The QDN memory write request flow is similar to the QDN memory read request flow. The difference comes from the fact that write request headers are stored in the CMD Buffer, and write request data are stored in the DATA Buffer. The write request header and the write request data are in different format and size. The memory controller dispatches the DDR3 write command and the DDR3 write data at different times. 3.3.4 RDN Memory Write Response Flow The RDN memory write response flow is similar to the memory read response flow. The difference is that the write response can be sent to an RDN port without waiting for the memory operation to be finished in the external memory. This is referred as an early write ack. Each memory write request has a NO_ACK attribute bit. When the NO_ACK attribute is asserted, no write response will be dispatched. This is referred as a no write ack. 3.3.5 Non-Cacheline Write Flow and Masked Write Flow DDR3 SDRAM is burst-oriented, with the burst length being programmed to eight by the memory controller.1 The burst of eight transfers 64 bytes of data over the 64-bits data on the physical interface. The cacheline size on the TILE-Gx processors is also sized as 64 bytes. One cacheline access maps to one burst of eight DDR3 access. If a QDN memory write request is not a full cacheline write, then memory background data will be fetched first so that the integrity of the background can be maintained and the optional ECC can be calculated. The memory controller writes back to the external memory after merging the 1. Note that the burst length can be specified either as four or eight, but it is only programmed to be a burst length of eight. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 61 Chapter 3 Double Data Rate SDRAM (DDR3) Interface background data (from the fetch) with the QDN memory write data (from the Data Buffer). In order to improve memory bus utilization, other memory requests can be scheduled between the background fetch and the merged data write-back. A masked write is a QDN memory write request where some of the bytes are masked (or not to be written). The memory controller treats masked writes the same as it treats the non-cacheline writes. 3.4 Ordering 3.4.1 Out of Order Dispatch Between a given source (that is, a tile or an I/O) and a destination (that is, a memory controller) pair, QDN memory requests are always routed in-order across the mesh network. However, the memory controller can dispatch memory requests out of order to improve performance on bandwidth and latency. The only exception to this rule is when a physical address conflict is detected between any two memory requests. An address conflict occurs when two memory requests overlap (partially or entirely), no matter whether the requests come from the same tile/I/O or not. These two memory requests will be dispatched in the order they are received by the memory controller, not necessarily back to back, other non-conflict requests can be dispatched in between. 3.4.2 Out of Order Response Between a source (that is a memory controller) and a destination (that is a tile or an I/O), RDN responses are always routed in order. However, the memory controller might choose to return the RDN responses out of order from the order that the corresponding QDN memory requests are received. The destination (that is a tile or an I/O) uses a unique tag inside the RDN response packets to differentiate outstanding responses. The memory controller copies the tag to the RDN response packet from the tag in the QDN memory request. 3.5 Addressing A QDN request packet provides a 40-bit physical address. The upper bit(s) are the controller selection bits, which are used to select the memory controller. For example, in the TILE-Gx36 implementation, bit 39 of the address bus is used to select the memory controller. The lower address bits, bit 38 to bit 0, are used by each memory controller, that is up to 512 G bytes of address space can be supported by each memory controller, as shown in Figure 3-2. 39 0 TILE PA Bits Controller Selection Bit (s) Memory Controller PA Bits Figure 3-2: PA Address Mapping 62 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Addressing 3.5.1 Memory Controller Striping Each memory controller has visibility to the lower address bits (bit 38 to bit 0). These physical address (PA) bits are identical to the PA bits sent from TILEs. The controller selection bit(s) can be configured as a hash function of other PA bits so that the memory requests are striped across multiple memory controllers. PA_hashed[39] = PA[39] ^ (((PA[25]&E[2]) ^ PA[18] ^ PA[15] ^ (PA[14]&E[1]) ^ (PA[9]&E[0])) & M[1]) PA_hashed[38] = PA[38] ^ (((PA[24]&E[2]) ^ PA[17] ^ PA[16] ^ (PA[13]&E[1]) ^ (PA[10]&E[0])) & M[0]) Where E[2:0] are configuration bits for hash function enable. E[2:0] are defined by the MEM_STRIPE_CONFIG[10:8] register. M[1:0] defines the striping modes. M[1:0] come from the selected two bits of MEM_STRIPE_CONFIG[7:0] indexed by PA[39:38]. Table 3-2. Striping Mode M[1:0] Description 00 No load balancing 01 Load balancing between memory controller pair (0,1) and/or pair (2/3) 10 Load balancing between memory controller pair (0,2) and/or pair (1,3)a 11 Load balancing between all controllers (0,1,2,3)a a.Applies to a TILE-Gx processor with four memory controllers. 3.5.2 DDR Address Mapping (from Memory Address Mapping) The physical address bits are mapped to the external memory address, as shown in Figure 3-3. The LSB field has three bits, because the physical interface has 64 bits (eight bytes). The width of the column field and the width of row field depend on the external DRAM components. The bank field has three bits, because each rank is comprised of eight banks. The rank field has four bits, because each memory controller supports up to 16 external DRAM ranks. 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 PA rank row bank column LSB DDR Address Figure 3-3: DDR Address Mapping Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 63 Chapter 3 Double Data Rate SDRAM (DDR3) Interface 3.5.3 Memory Rank/Bank Hashing The row field and the column field are identical to the corresponding PA bits. The rank and the bank fields can be configured as a hash function of other PA bits, as the following examples show. • Rank[1] = PA_rank[1] ^ ((PA[11] ^ PA[12] ^ PA[25]) & CFG_ADDR_HASH[4]) • Rank[0] = PA_rank[0] ^ ((PA[10] ^ PA[16] ^ PA[24]) & CFG_ADDR_HASH[3]) • Bank[2] = PA_bank[2] ^ ((PA[9] ^ PA[6] ^ PA[22] ^ PA[23]) & CFG_ADDR_HASH[2]) • Bank[1] = PA_bank[1] ^ ((PA[8] ^ PA[18] ^ PA[21] ^ PA[26]) & CFG_ADDR_HASH[1]) • Bank[0] = PA_bank[0] ^ ((PA[7] ^ PA[19] ^ PA[20] ^ PA[27]) & CFG_ADDR_HASH[0]) Where the bit location of PA_rank and PA_bank are determined by the width of the row field and the column field. 3.5.4 Logical Rank and Physical Rank Mapping The memory controller maps the 4-bits logical rank bits to the external 16 physical rank selection bits. The physical rank selection bits are connected to the DRAM sockets (or components) on the board. Depending on what type of DRAM modules are populated in the DRAM sockets, for example, single-rank, dual-rank, or quad-rank, the mapping function can be configured accordingly through the MSH_DDR3_DIMM_CFG register. 3 0 rank 15 logical 0 rank physical Figure 3-4: Rank Mapping 3.6 Scheduler 3.6.1 Memory Page Management Policy The memory controller supports an open-page policy and a close-page policy. The page management policy can be configured by using the MSH_CONTROL register. When the close-page policy is used, a DRAM page will be closed after its memory reference is done. In general, the close-page policy provides more deterministic memory latency, while it also consumes more power on external memories. When the memory access patterns are random (that is minimal spatial and temporal locality), and the memory request rate is light, the close-page policy could provide lower memory latency in some applications. When the open-page policy is used, a DRAM page stays open once it is opened (with the hope that the same page will be accessed in the near future). When the open-page policy is enabled, the memory controller must decide if the auto-precharge command should be applied at the time when a memory read command, or a memory write command should be dispatched. The decision is based on scheduled memory requests that are to be dispatched in the near future. As such, the open-page policy could result in similar decision on page management as the close-page policy when the memory access patterns are random and the memory request rate is heavy. 64 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice DIMM Support 3.6.2 Memory Request Reordering The memory controller uses a 32-entry CAM to assist the scheduling decisions. The decisions are based on several considerations. • Ordering must be enforced if memory requests reference to the same address. • The Read-first policy is used. The memory controller has separate read and write queues. Reads have higher priority than writes, that is read > write. • The Hit-first policy is used. The memory controller checks if the QDN memory request references a DRAM open page or a DRAM close page. For example, a read request to an open page (that is a read page hit) has a higher priority over a request to a closed page (that is a read page miss). This will be noted as read page hit > read page miss. Similarly, write page hit > write page miss. • The Priority-first policy is used. Priority level can be assigned to QDN memory requests. For example, memory requests from a mission critical task could be assigned a high priority. Memory requests with a high priority tend to be serviced first, as such, the response latency tend to be lower. • TILE source awareness. The memory controller checks for the source of the QDN memory requests for memory request scheduling. • Anti-starvation controls are in place. If a request has not been served for certain amount of time, this request times out and is given a higher priority. Therefore, timed out read page hit > timed out read page miss > timed out write page hit > timed out write page miss > read page hit > read page miss > write page hit > write page miss. The starvation threshold is softwareprogrammable. Note that timed out writes have a higher priority over reads. If the starvation values are programmed such that writes are more likely to time out then reads, then writes appear to have a higher priority. Writes tend to be serviced later than reads, and many writes can sit in the CAM. Another threshold can be configured so that writes are treated as starved if there are too many write requests pending in the CAM. • Memory requests are stored in virtual queues and sorted based on memory rank and bank. Load balancing is applied to reduce the overhead associated with precharge/activate. • Memory requests come from one of two QDN networks. Load balancing is applied to improve fairness among the two networks and reduce potential network congestion. • To reduce the DRAM overhead due to turnaround between read and write, it is desirable to stay in the current read (or write) queue for some amount of time, which is programmable by software. • To reduce the DRAM overhead due to turnaround between ranks, it is desirable to stay in the same rank for some amount of time. 3.6.3 Memory Command Reordering Each QDN memory read/write request might result in multiple external memory commands (that is precharge, activate, column read/write). In order to reduce the potential overhead associated with precharge and activate, the memory control may reorder the precharge/activate relative to its associated read or write request. For example, a sequence of pchg1, act1, read1, pchg2, act2, read2 can be converted to pchg1, pchg2, act1, act2, read1, read2. 3.7 DIMM Support The memory controller supports many kinds of memory modules, which are described in the following sections. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 65 Chapter 3 Double Data Rate SDRAM (DDR3) Interface 3.7.1 Serial Presence-Detect EEPROM Support The SPD EEPROM can be accessed through the one of the on-chip I 2C master interfaces. 3.7.2 Temperature Sensor Certain DIMM modules include on-board I2C temperature sensor with integrated serial presencedetect (SPD) EEPROM. System designers can program the registers on the SPD EEPROM to customize the temperature-sensing configuration. When critical temperature thresholds have been exceeded, temperature sensor will assert the event to the memory controller. The edge-triggering event from “not-exceeding” to “exceeding” the temperature threshold will assert an interrupt bit in the memory controller. 3.7.3 Address/Command Parity Certain DIMM modules support parity detection on address/command bus. The memory controller will generate an even parity bit for the address and command (A/BA/RAS_N/CAS_N/WE_N). When a parity error is detected, the DIMM module will assert the parity error to the memory controller. The edge-triggering event from no-error to error will assert an interrupt bit in the memory controller. Two parity error interrupts are implemented to differentiate which DIMM modules have detected the parity error. 3.7.4 RDIMM Control Word Access RDIMM control words provide configuration of certain device features on the DIMMs. The control words are accessed by the simultaneous assertion of first two DDRx_CS_N on a DIMM. Refer to MSH_DDR3_USER_INIT_1 register for more details on control word programming. 3.7.5 Memory PHY Training The memory controller has various initialization state machines to assist the external DRAM initialization sequence, and to bring-up the memory PHY interface. Refer to the memory PHY registers for more details. 66 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 4 (TRIO) PCI E C ONTROLLER A RCHITECTURE 4.1 Overview The TILE-Gx™ processor’s PCIe Controller provides services to integrate the Tile processors with a PCI system. Both endpoint and root complex modes are supported. Several data movement and communication models are supported simultaneously. These can be summarized as: • Tile Programmed I/O (PIO): Tile software communicates directly with the PCI system using Memory Mapped I/O (MMIO) loads and stores. PIO can be used for configuration in root complex mode. • Host PIO: Host software or a connected PCI device communicates with Tile physical memory space using reads/writes. The host or PCI device can use DMA transfers to move data to/from Tile physical memory space. • Push DMA: Bulk data transfer from Tile physical memory to PCI address space, and is typically used in endpoint mode. • Pull DMA: Bulk data transfer from PCI address space to Tile physical memory, and is typically used in endpoint mode. • Ingress Scatter: Writes from the PCI system consume buffers enqueued at the PCIe controller. The various controller interfaces are shown in Figure 4-1. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 67 Chapter 4 PCIe Controller Architecture (TRIO) MAC RX (P/NP) MAC RX (CPL) MAC TX Region Match Read Data PCI PIO Completions Tile Request Manager Pull DMA (PCI Reads) Boot & Rshim Write Data Write Data Push DMA (PCI Writes) Request Tracker ePkt Tile Writes SCTR DMA Picker Picker SCTR Q 0|1|2|...|N Tile Messages 0|1|2|...|N Read Desc CTL Message Read Data SW Post Write PCI CPL Data Buf descFetch PIO Buffer descFetch SW Post RingPtr Tile PIO Read/Write SW Post RingPtr Figure 4-1: PCIe Controller 4.1.1 Communication and Data Transfer Bulk data transfer between the I/O device and Tile software is through coherent shared memory reads and writes. Data is moved directly to and from caches to minimize the off-chip memory bandwidth requirement. Configuration and PIO traffic utilize MMIO. Interrupts are delivered to Tile software through the IPI mechanism. 4.1.2 PHY Sharing On devices that support more than one PCIe port, each port can have its own PCIe interface hardware and DMA engines or the DMA functions can be shared between the ports. The PHY can also be shared between multiple interfaces in order to optimize utilization of SERDES lanes, but the software interface provides completely independent ports and any sharing of DMA resources is non-blocking between ports. See Figure 4-2. 68 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice MMIO Interface SerDes Lanes SerDes Lanes SerDes Lanes PCIe/ StreamIO MAC0 PCIe/ StreamIO MAC1 PCIe/ StreamIO MAC2 TRIO (Transaction I/O) 5,2 Tiles Figure 4-2: PHY/DMA Sharing Example Implementation Note: The TILE-Gx36 device has 12 Gen2 SERDES lanes and three PCIe ports. This allows configuration of a x8 link with one x4. Or three x4s (or smaller). 4.2 MMIO Interface Tile software communicates with the PCIe controller via loads and stores in MMIO space. The PCIe interface interprets the physical address in the MMIO loads and stores as shown in Figure 43 and Table 4-1. 37 36 35 34 33 32 31 Region Offset Figure 4-3: MMIO Address Mapping Format Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 69 Chapter 4 PCIe Controller Architecture (TRIO) Table 4-1. MMIO Address Mapping Description Bits Name Description 37:36 Reserved Reserved 35:32 Region Selects 0 1 2 3 8-15 16 313:q Offset Up to 4GB of offset within the region being accessed. access to one of the following: Config Space (port level is in offset 17:16. Push DMA Posts Pull DMA Posts Scatter Queue FIFOs PIO (Eight regions) MAP MEM (Interrupt) Registers System software can use large pages to cover multiple regions with the same PTE, or smaller pages to provide limited access to PCIe structures and PIO space. The PIO, push-DMA, and pull-DMA regions are described in the sections that follow. 4.3 PIO Communication The PCIe controller’s PIO interface provides direct PCI memory-space communication with the PCI port(s). This interface is typically used for lower bandwidth communication such as root-todevice configuration, DMA setup and interrupts. Loads and Stores of 1, 2, 4, and 8 bytes are translated into PCI config or memory space reads and writes then sent to the PCI system. The PIO interface supplies eight independent PCI translation regions to map the Tile’s physical address into a PCI address. These are configured with the TRIO_TILE_PIO_REGION_SETUP registers. Each PIO region also has configuration register settings for the MAC it is associated with and the PCIe access type. The access type can be configured as memory, config, or I/O. When operating as a memory PIO region, the translation region’s base address is appended as the MSBs to the low 32 bits of the Tile’s MMIO load/store address to form the PCI address. A request tracker is used to match incoming completions with outstanding MMIO loads. When operating as a config PIO region, the offset bits are interpreted as bus, device, and function number. I/O transactions use the low 32-bits of the MMIO address as the I/O address. 4.3.1 Memoryless Operation In order to support operation without using any local memory controllers, the PCIe interface allows PIO regions to behave as memory controllers. Thus the main memory backing is provided by the PCIe subsystem rather than the local device’s memory controllers. System boot software must configure the Tiles’ MEM_MAP registers to point to the PCIe controller instead of a memory interface. Memory reads and writes or 1, 2, 4, 8, or 64 bytes will be converted using the PIO_REGIONs into PCI reads and writes. The PIO_REGION is selected by the same physical address bits used for MMIO-based access and is defined in the TRIO_MMIO_ADDRESS_SPACE definition. The remaining address bits are converted to a PCI address as described above. 70 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Push DMA The strict order mode for the PIO region must be used for memoryless operations in order to preserve the Tile’s memory transaction order. Memoryless mode is intended for very tightly controlled environments and as such introduces several restrictions. TRIO must not be allowed to perform any IO_READs to the Tile memory system if those reads might miss in the L3 cache, because the home tile will forward the request to memory (TRIO), which does not handle this type of access. This restriction means that the system cannot use PUSH DMA to move data up to the host unless it is guaranteed to hit L2. The TILE-Gx cannot support ingress reads in memoryless mode, other than to the Rshim, unless they too are guaranteed to hit. Memoryless Systems requiring communication with Tile software must use write-write messaging for communication and generally avoid host-to-tile reads as well as Tile-to-host DMAs. Additionally, the SEND_COPY attribute is not supported in memoryless mode. Systems employing memoryless operations via TRIO must set each Tile’s XDN_ATTR_SEND_COPY_DISABLE bit in the CBOX_MSR SPR. 4.3.2 Ordering MMIO loads and stores are converted into PCI reads and writes respectively. The ordering of the PCI transactions follows the PCI order model. This model allows writes to pass reads. Hence an MMIO store can complete prior to an MMIO load even if the store was issued by the Tile after the load and even if they are to the same address. System software utilizing PCI PIO must be written to operate correctly within the PCI order model. An optional configuration setting in the TRIO_TILE_PIO_REGION_SETUP register allows the region to be configured as strict order. In this mode, no reads or writes will be sent to the PCI interface if a previous read (or config write) is outstanding. This mode insures a strict order of transactions on the PCI bus, but sacrifices both read and write bandwidth. Additionally, each PIO_REGION has a separate 32-entry FIFO to reduce head of line blocking between transactions targeting different devices/MACs. If Tile software requires ordering between two different PIO_REGIONs, it must issue a memory fence between the operations. A PIO write transaction will not be completed from the Tile’s perspective until it has been sent to the PCIe MAC. Thus a memory fence is sufficient establish ordering between regions and a fencing-read is not required. 4.4 Push DMA Push DMA is used to move data from Tile memory space to PCI memory space with low Tile processing overhead. The PCIe controller utilizes descriptor rings and a gather engine to collect data and move it to PCI. 4.4.1 Descriptors Push-DMA transactions are described by descriptors. The descriptors are written into rings in the Tile’s memory space and are either posted to the DMA engine via MMIO or automatically read by the descriptor fetch engine in hunt mode. The push DMA interface’s descriptor rings provide independent flows for QoS or applicationbased differentiation of flows. Each ring is processed in-order by the push DMA engine, but ordering between different rings is not maintained. Any ring can be configured to move data to any MAC. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 71 Chapter 4 PCIe Controller Architecture (TRIO) Implementation Note: The TILE-Gx36 device provides 32 Push-DMA descriptor rings. The descriptors support gather functionality by allowing a PCI push transaction to be assembled from multiple Tile-side buffers. The descriptor format is shown in Figure 4-4. VA[31:0] Size Gen sMOD NTF C 0x00 BSZ VA[41:32] 0x04 PCI Address [31:0] 0x08 PCI Address [63:32] 0x0C Figure 4-4: Push/Pull-DMA Descriptor Format Table 4-2. Push/Pull-DMA Descriptions Bits Name Description 31 Gen Generation Number. Used to indicate valid descriptor in ring. 13 C Chaining Designation. Always 0 for pull DMA. 0 Un pointer. 1 pointer. Next buffer descriptor (for example, VA) stored in first 8 bytes of the buffer. For s, the BSZ field is used to determine the size of the first buffer in the chain. Subsequent buffers are sized using the size field of the buffer descriptor. 12:10 BSZ Buffer Size. Encoded size of the first buffer in the chain when C is equal to 1. 0 128 bytes 1 256 bytes 2 512 bytes 3 1024 bytes 4 1664 bytes 5 4096 bytes 6 10368 bytes 7 16384 bytes 14 NTF Signal interrupt for this ring when the transaction is complete. 29:16 Size Total number of bytes to move for this transaction. When sMode is equal to 1, this field is encoded (see below). When sMod=0 and Size=0, the transaction is a NOP. A SizeZero (NOP) descriptor with NTF-1 can be used to generate an interrupt when all older descriptors have completed. No read/write packets will be sent to the MAC and no Tile memory will be affected. 15 sMod 0 1 72 When 0, the Size field specifies the total byte count for the transaction. When 1, the Size field is encoded as 2^(N+14) for N in {0...0}: 0=16KB 1=32KB ... 6=1MB All other encodings of Size field are reserved when sMode=1. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Push DMA 4.4.2 Request Partitioning The push-DMA engine gathers data for the transaction and partitions the PCIe write packets based on PCIe alignment and sizing rules. There are no alignment restrictions on the Tile side or PCI side addresses. The MPS setting in the TRIO_PUSH_DMA_RG_INIT_DAT register determines the maximum packet size transmitted on the link. The DMA packet generator will also never cross an MPS-aligned PCI Address boundary. This can result in an extra packet being sent on the link. But for all but the smallest transfers, this effect is negligible. 4.4.3 Notification and Flow Control As push-DMA transactions are processed, there are different types of notifications that software can require. 4.4.3.1 Descriptor Rings Slot Available Notification An MMIO-readable head pointer allows Tile software to determine how much space is available in a DMA ring. This head pointer does NOT indicate that the associated descriptors have been completely processed, only that ring locations older than the head can be reused for new descriptors. 4.4.3.2 Transaction Complete Notification The NTF bit in each descriptor can be used to generate an interrupt to Tile software upon completion of the DMA transfer. A running count of descriptors processed for a given ring is also provided so that software can determine which descriptors have been completely processed. 4.4.3.3 PCI System Notification Push DMA transfers typically involve writing bulk data to the PCI system and subsequently messaging the PCI system that the transaction is complete. This is done by putting an additional push descriptor, typically to the MSI location on the host, into the descriptor rings after the bulk data transfer descriptor. Hence no additional special-purpose hardware is required. 4.4.4 Flush/Fence When an application crashes or a ring needs to be reconfigured, the flush/fence mechanism allows hardware resources to be reclaimed without impacting unrelated rings and flows. The flush flow below must be used when a ring with FLUSH_MODE bit of the TRIO_PUSH_DMA_RG_INIT_DAT_ASID register asserted takes a TLB fault. Attempting to restart such a ring without first flushing might cause packet corruption or hardware lockup. The procedure for recovering a push DMA ring’s hardware resources is: 1. Set the ring’s frozen stall, and flush bits in the TRIO_PUSH_DMA_DM_INIT_DAT_SETUP register to prevent additional descriptors from being fetched and processed. This will also flush already-fetched descriptors and buffer data. 2. Issue an MF to insure that the register setting has completed. 3. Poll the FLUSH_PND bit of the TRIO_PUSH_DMA_CTL register until it is clear to insure that the descriptor flush has completed. 4. Set the FENCE bit of the TRIO_PUSH_DMA_CTL register to initiate a coherence fence on outstanding push DMA data reads. 5. Poll the FENCE bit of the TRIO_PUSH_DMA_CTL register to insure that all outstanding requests have completed. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 73 Chapter 4 PCIe Controller Architecture (TRIO) 6. MF. 7. Poll the FENCE bit of the TRIO_PUSH_DMA_CTL register until it is clear. 8. Poll the FLUSH_PND bit of the TRIO_PUSH_DMA_CTL register to insure that buffer flush has completed. 9. Software can now use the COUNT bit of the TRIO_PUSH_DMA_REGION_VAL register to determine which descriptors have been processed by hardware and which were not. 10. Write TRIO_PUSH_DMA_DM_INIT_DAT_HEAD register to zero for the associated ring. 11. Write TRIO_PUSH_DMA_DM_INIT_DAT_DESC_STATE0 to 0x1 and TRIO_PUSH_DMA_DM_INIT_DAT_DESC_STATE1 to 0x0. This places the descriptor ring fetch into its initial state. 12. Read Interrupt Status register (TRIO_INT_VEC*_RTC) from the bound Tile to flush any outstanding interrupts. 13. MF. 4.5 Pull-DMA The Pull-DMA engine is used to move data from the PCI system to the Tile memory system. Similar to push-DMA, the pull-DMA utilizes descriptor rings to manage transactions (Figure 4-5). MAC RX (CPL) MAC TX PCI Reads Write PCI CPL Data to Tile Memory Request Tracker Request Partition Picker 0|1|2|...|N descFetch Figure 4-5: Pull DMA The descriptor format is identical to push-DMA and shown in Figure 4-4, however only a single Tile-side buffer descriptor is supported for each transaction. Hence software must post multiple pull-DMA descriptors to perform a scatter operation. Implementation Note: The TILE-Gx36 device provides 32 Push and 32 Pull-DMA descriptor rings. 74 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Flush/Fence For each PCI read request that is generated, the Tile-side physical address is calculated using the technique described in 4.7 Address Translation. If a translation fault occurs, the associated DMA ring is frozen until software installs a proper translation. PCI read transactions prior to the translation fault will be completely processed. 4.5.1 Pull DMA Notifications and Flow Control As pull DMA transactions are processed, there are several types of notifications required. 4.5.2 Descriptor Rings Slot Available Notification The descriptor rings are maintained identically to the push DMA descriptor rings. 4.5.3 Transaction Complete Notification Once the pull-DMA transfer has completed and the data is visible to Tile software, an interrupt can be optionally delivered via the IPI mechanism by setting the NTF bit in the descriptor. Software can also poll for transaction completion information by reading the associated TRIO_PULL_DMA_REGION_VAL.COUNT register. 4.5.4 Request Tracker The Pull-DMA engine partitions the transaction specified in the descriptor into legal PCIe requests based on PCIe alignment rules and max-request-size limitations. Each PCIe read request packet is tracked using a hardware request tracker. As completions are returned on the PCIe port, the request tracker entry provides the Tile side memory address to which the completion data is returned. The request tracker is used to detect a number of exception cases including unexpected completions and request timeouts. The request tracker state can be cleared by software or when the DL_Down state is entered on the PCIe link. 4.6 Flush/Fence When an application crashes or a ring needs to be reconfigured, the flush mechanism allows hardware resources to be reclaimed without impacting unrelated rings and flows. The procedure for recovering a pull DMA ring’s hardware resources is: 1. Set the ring’s freeze, flush and stall bits in TRIO_PULL_DMA_DM_INIT_DAT_SETUP to prevent additional descriptors from being fetched and processed. Note that if the DMA engine detects an error1, the ring will automatically be frozen and stalled so only the flush bit will need to be asserted (w/o clearing the freeze and stall bits). 2. Issue an MF to insure that the register setting has completed. 3. Poll the FLUSH_PND bit of the TRIO_PULL_DMA_CTL register to insure that all outstanding requests have completed. 4.7 Address Translation Push and pull DMA descriptors provide data pointers in virtual address space. These are translated to physical addresses utilizing the entries in the shared TLB. Each ring is associated with a 4-bit ASID. The ASID provides the context in which to evaluate the VA. 1. For example, a TLB fault on a ring with the FLUSH_MODE bit in the TRIO_PULL_DMA_RG_INIT_DAT register is asserted. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 75 Chapter 4 PCIe Controller Architecture (TRIO) If the translation for a descriptor fails, the associated ring is frozen and an interrupt is sent via the IPI mechanism. Software must install a translation and re-enable the ring to continue processing. PCIe addresses are similarly translated using the same shared I/O TLB. Each ingress mapping region has an associated ASID and BASE_VA. This is used to form the VA provided to the I/O TLB. The shared I/O TLB is managed by system software and has the following properties: • There is one central I/O TLB shared between Push-DMA reads, Pull-DMA writes, and MAP reads/writes. • There are 16 TLB entries per ASID and 16 ASIDs. • There is a dedicated interrupt binding for Push-DMA, one for Pull-DMA, and one for MAPmem. • Push-DMA fault data is captured in TRIO_TLB_PULL_DMA_EXC register. • Push-DMA faults cause the associated descriptor to be retried until the fault is handled. Thus subsequent interrupts for the Push-DMA binding can occur and there will be a slight drop in Push-DMA performance while the miss is being handled. • Alternatively, each push-DMA ring can be put into a drop-on-fault mode via the FLUSH_MODE bit in the associated TRIO_PULL_DMA_RG_INIT_DAT register. When a push DMA descriptor is discarded, the PUSH_DESC_DISC interrupt will be triggered. • Pull-DMA fault data is captured in TRIO_TLB_PULL_DMA_EXC. • Pull-DMA faults will cause pull DMA writes to stall thus only one Pull-DMA fault will occur at a time. If the Pull-DMA fault is not handled in a timely fashion, the Pull-DMA engine will stop issuing new PCIe read transactions for the associated MAC and Pull-DMA performance will drop. • Alternatively, each pull-DMA ring can be put into a drop-on-fault mode via the associated FLUSH_MODE bit of the TRIO_PULL_DMA_RG_INIT_DAT register. When data is dropped in this mode, a PULL_DATA_DISC interrupt will be triggered. • MAP reads and writes each have their own TLB fault interrupt and associated TRIO_TLB_MAP_WR_EXC/TRIO_TLB_MAP_RD_EXC register. • MAP faults will cause the subsequent reads/writes to stall. This could cause the PCIe requesting agent to hit a read timeout or violate timeliness rules for posted transactions. This can also cause deadlock in systems where PIO is being used. Hence most systems will not use the faultin flow on MAP accesses but rather use fixed TLB mappings to provide windows into PA space. The I/O MMU described in the following section is more appropriate for dynamic page mappings. 4.7.1 I/O MMU Incoming PCIe requests that target a MAP MEM region use the I/O MMU table if the region’s USE_MMU bit is set. The MMU provides many more translations than the I/O TLB and is generally used to map 32-bit I/O addresses into PA space above 4GB or to provide custom caching attributes (homing) for specific transactions and devices. Implementation Note: The TILE-Gx36 provides 4096 MMU table entries. The I/O MMU does not provide a fault-in capability hence operations targeting the MMU must coordinate beforehand to allocate and configure entries. 76 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Mapping Regions The index into the MMU table is formed by taking N bits of the VA starting at the MMU page size. N is based on the size of the table (for example, 12 bits for a 4K table). This base index is then added to (ASID<<M) where M is Log2TableSize-Log2NumASIDs. This allows the table to be partitioned into NumASIDs sub-tables. Thus the BASE/LIM pair and ASID allow each MAP region to be associated with a portion of the MMU table and allow the MMU table to be shared/protected between different MAP MEM regions. Each MMU table entry provides a PA, homing attributes, and a valid bit. If a request targets an MMU entry with the valid bit clear, the request will be redirected to the PANIC_PA and an MMU_ERROR interrupt will be triggered. 4.8 Ingress Mapping Regions PCIe requests arriving on the link are compared to the BAR or BAL registers in PCI configuration space to determine if the request properly targets the device. Packets whose addresses fall within the devices range are then compared to “mapping regions” to determine what action is taken with the packet and its data. A mapping region consists of a 4KB aligned base and limit within PCI address space and regionspecific attributes described below. Requests that fail to match any regions will be shunted to the PANIC_PA stored in the TRIO_PANIC_MODE_CTL register. This also will trigger the MAP_UNCLAIMED interrupt. Requests that target the PANIC_PA prior to system software configuring the tile (e.g. prior to boot) will yield unpredictable system behavior. System designers must coordinate memory-mapped communication to the TILE-Gx such that PANIC_PA requests are not generated prior to TILE-Gx boot software preparing memory space and exiting cache-as-RAM mode. PCI Address Space Tile Memory Read(s) PCI Read Tile Memory Region (VA, ASID) MemRegion MemRegion Tile Response(s) BAR (Endpoint) !BAL (Root Complex) PCI Completion(s) Boot Rshim Figure 4-6: PCI Region Example Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 77 Chapter 4 PCIe Controller Architecture (TRIO) 4.8.1 Tile Map Memory Regions Tile memory regions provide a mapping between PCI addresses and Tile memory addresses. Incoming read and write PCI requests that match a memory region are converted to Tile memory reads and writes. Implementation Note: The TILE-Gx36 provides 16 Tile Memory Regions. Each Tile memory region contains a 4KB aligned Tile-side base VA and ASID in addition to the PCI address base and limit. The VA for a request is calculated as (IO_ADDRESSBASE)+BASE_VA. This VA is processed through the I/O TLB or I/O MMU, based on the region’s USE_MMU bit to produce a Tile PA and associated attributes, such as hash-for-home and caching attributes. To prevent system level deadlock, the I/O TLB fault mechanism is not typically used to fault in translations dynamically. Instead, the I/O MMU table is used when the system requires translations to be updated dynamically. A single PCI request might need to be partitioned into multiple Tile memory transactions. The controller implements tracks outstanding Tile memory reads in order to form proper PCIe completions. 4.8.1.1 MAP-MEM Interrupts Each map-memory region has a set 16 general-purpose interrupt bits. These bits are accessible both from the Tile side and from the PCI Express interface. The bits can be configured to trigger Tile-side interrupts. Each map-memory region allows its interrupt vector to be configured to dispatch the associated Tile-side interrupt based on level or edge semantics. The interrupt vector itself can be accessed from the PCI Express or Tile MMIO interfaces via one of four different registers. Each register has unique access semantics as described below: Table 4-3. Register Behaviors Register Number Read Behavior Write Behavior 0 (R/W) Returns current value Writes a new value. 1 (RC/W1TC) Returns current value, clear all bits. 1 0 Clears bits if written with 1. Leaves intact if written with zero. 2 (R/W1TS) Returns current value. 1 0 Sets bits if written with 1. Leaves intact if written with zero. 3 (R/SetBit) Returns current value. Sets the bit indexed by the data value (for example, data value indicates which bit is to be set). 4-7 Exhibits the same behavior as registers 0-3, but without any “edge” interrupts. Exhibits the same behavior as registers 0-3, but without any “edge” interrupts. From the Tile side, these registers are accessible via the MAP region within MMIO space. From PCI Express, these registers appear as the first 64 bytes of the associated map-memory region. Each register occupies 8-bytes of address space. Registers 4-7 behave the same as the associated register in locations 0-3, but they do not generate any edge interrupts. When the INT_ENA bit is set for the associated MAP MEM region, the region’s address space is formatted as shown in Figure 4-7. 78 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Mapping Regions MAP_MEM_BASE MAP_MEM_INT0 Register MAP_MEM_BASE+7 MAP_MEM_BASE+8 MAP_MEM_INT1 Register MAP_MEM_BASE+15 MAP_MEM_BASE+16 MAP_MEM_INT2 Register MAP_MEM_INT3 Register MAP_MEM_BASE+23 MAP_MEM_BASE+24 MAP_MEM_BASE+31 MAP_MEM_BASE+32 Same as MAP_MEM_INT0-3, except no edge-interrupt generation. MAP_MEM_BASE+63 MAP_MEM_BASE+64 Normal MAP-MEM Address Space Mapped to Tile Memory Space MAP_MEM_lim Figure 4-7: MAP Region within MMIO Space 4.8.1.2 Map-Region Ordering Each map-memory region can be configured into one of three modes via the ORDER_MODE field in the associated TRIO_CFG TRIO_MAP_MEM_SETUP register: • UNORDERED: Writes to different cachelines are not ordered with respect to each other. Reads will never complete until all older writes in all mapping regions have become visible to Tile software. • STRICT: Writes and reads are strictly ordered, even to different cachelines and across different mapping regions (including Scatter Queue (SQ) regions). This mode might result in decreased write performance. • REL_ORD: Write ordering is enforced if the incoming packet’s relaxed-ordering attribute is clear. If the packet’s relaxed-ordering bit is set, the writes are unordered. Reads will never complete until all older writes in all mapping regions have become visible to Tile software. The interrupt registers are updated using the STRICT order model defined above. That means all previous write data, even to UNORDERED regions, will be visible to Tile software prior to the interrupt state registers being updated. The Tile-side interrupt will also follow the STRICT order model and be triggered at the point the interrupt register write is made visible. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 79 Chapter 4 PCIe Controller Architecture (TRIO) 4.8.2 Scatter Queue Regions Scatter queue (SQ) regions provide a means for mailbox/doorbell communication via PCI Express. Each scatter region consists of a descriptor FIFO used to provide target VAs for incoming PCIe requests. Both reads and writes are supported to SQ regions. Each time a PCIe write or read is received within a scatter region, its Tile-side VA is calculated by adding the offset within the region to the VA provided by the descriptor at the head of the descriptor FIFO. Implementation Note: The TILE-Gx36 provides eight SQ Regions. A single write-only 8-byte register in the last 8-bytes of the region provides the ability to dequeue the descriptor and/or generate an interrupt (doorbell). The register format is described in TRIO_MAP_SQ_DOORBELL_FMT in the register specification and shown in Table 4-4. Table 4-4. TRIO_MAP_SQ_DOORBELL_FMT Register Bit Descript6ions Bits Name Type Reset Description 1 POP WO 0 When written with a 1, the descriptor at the head of the associated MAP_SQ's FIFO will be dequeued. 0 DOORBELL WO 0 When written with a 1, the associated MAP_SQ region's doorbell interrupt will be triggered once all previous writes are visible to Tile software. Writes to the TRIO_MAP_SQ_DOORBELL_FMT register must be 8-bytes or less (for example not combined with a write to the data portion of the base and limit region). Writes smaller than 8bytes to the upper 7-bytes of the TRIO_MAP_SQ_DOORBELL_FMT register have undefined behavior. Larger writes that happen to overlap the TRIO_MAP_SQ_DOORBELL_FMT register will be written to Tile memory and will not access the register. Reads to the TRIO_MAP_SQ_DOORBELL_FMT will also access Tile memory and have no impact on the register. The TRIO_INT_VEC3_W1TC vector register contains the interrupts associated with the map SQ regions. These interrupts are referred to as “doorbell interrupts”. The lower eight bits contain information about the associated region’s doorbell interrupt. The next eight bits are the associated region’s descriptor-dequeue interrupt. The TRIO_INT_VEC3_W1TC register is paired with the TRIO_INT_VEC3_RTC register, which provides the mechanism for clearing the associated doorbell interrupts when it is read. The scatter queue descriptor format is described in Table 4-5. Table 4-5. TRIO_MAP_SQ_REGION_WRITE_VAL Bit Descriptions 80 Bits Name Type Reset Description 63 INT_ENA WO 0 Indicates that an interrupt is requested when this descriptor is dequeued. 62:42 Reserved 41:12 VA 11:0 Reserved Reserved WO 0 4KB-aligned VA to be used on incoming MAP_SQ writes. The VA for an incoming write will be IO_ADDRESS - MAP_SQ_BASE + VA. Reserved Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Mapping Regions Tile software is responsible for keeping the 64-entry descriptor FIFO full. The ‘I’ (interrupt) bit can be used for flow control so that it knows when more descriptors are needed. If more than 64 descriptors are written into the descriptor FIFO, it might drop descriptors and trigger the MAP_SQ_OVFL interrupt. If an incoming PCIe transaction falls within an SQ_REGION’s base and limit but there are no valid descriptors in the FIFO, the transaction will be directed to the PANIC_PA bitfield stored in the TRIO_PANIC_MODE_CTL register. This will also trigger the MAP_SQ_EMPTY interrupt. Accesses to the write-only DOORBELL bitfield of the TRIO_MAP_SQ_DOORBELL_FMT register is only allowed if there is a valid descriptor in the associated FIFO. If a DOORBELL write arrives from the PCIe and there is no valid descriptor, the access will be directed to the PANIC_PA bitfield, as described above. As with MAP-Memory regions, SQ_REGIONs provide UNORDERED, STRICT, and REL_ORD modes as described above. The doorbell interrupt is always delivered as a strict-order operation. In other words, the interrupt will not be triggered until all older writes are visible to Tile software. 4.8.3 Boot and Rshim Regions Rshim access is provided through a dedicated mapping region. The Rshim region consists of 1MB of PCI address space mapped to the Rshim’s address space. The address is interpreted as {channel,offset}. For more information, refer to the TRIO_MAP_RSH_ADDR_FMT register. The Rshim region is mapped into PCIe BAR-0 automatically on hardware reset. Additionally, if the port is enabled for root complex (RC) operation, the 1MB region will be enabled in the low address range of the negatively decoded base and limit region. Thus the Tile processor can be booted from either a root complex or endpoint device. The Rshim consists of 64-bit registers. In order to provide compatibility with 32-bit (dword) PCIe accesses, the Rshim contains a set of registers for mapping the 64-bit register accesses into indirect 32-bit accesses. Boot is achieved by writing the Rshim’s packet generator interface through the Rshim map region. Software can implement flow control to prevent boot transactions from backing up onto the PCIe bus. This flow control consists of polling the Rshim’s packet generator data-words-sent counter to see how much data has been sent. The interface can sync up to 4KB of boot data without back pressuring the MAC, so the flow control only needs to be done once for every 4KB of data. One example software algorithm would be to send 4KB of data, then poll until 2KB had been sent, then send another 2KB, etc. A separate FIFO provides a path for non-boot transactions to the Rshim. Thus, as long as software does not send more than 4KB of boot data, read and write accesses to Rshim registers will complete without blocking. Only 1,2,4, and 8 byte writes and reads are supported to the Rshim region. A read request for more than 8 bytes will result in a completion with an UnsupportedRequest status. Writes larger than 8-bytes will only complete the first 8-bytes written. The remaining bytes will be dropped. 4.8.4 Map Fence When a MAP-MEM/SQ region needs to be reconfigured, the MAP-Mem Fence mechanism can be used to guarantee that all outstanding transactions have completed. To reconfigure a MAP-MEM or SQ region, the following procedure should be used: Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 81 Chapter 4 PCIe Controller Architecture (TRIO) 1. Disable the region by clearing its MAC_ENA bits in the TRIO_MAP_MEM_SETUP or TRIO_MAP_SQ_SETUP register. 2. Issue an MF to insure that the register write has completed. 3. Set the FENCE bit in TRIO_MAP_MEM_CTL. 4. Poll the FENCE bit until it clears. 5. Poll the RDQ bitfield of the TRIO_MAP_DIAG_FSM_STATE register until it sets to zero to ensure all reads have issued to Tile memory space. Software can now be sure that no new transactions will arrive from the associated region, no TLB misses will occur from the region, and all older writes will have completed. 6. If software needs to ensure that all reads have completed (for example all data fetched from Tile memory), the FENCE bit of the TRIO_PUSH_DMA_CTL register must also be written and polled at this time. 4.9 Panic Mode Since Tile-side software is required to fill I/O TLB translations to allow forward progress of MAPMEM/SQ transactions, it is possible for the PCIe bus to become clogged with transactions if the Tile-side software has crashed or otherwise stopped responding to TLB fill requests. In order to allow system recovery without crashing the PCIe host system, an optional timer configured in the TRIO_PANIC_MODE_CTL register detects when trio is preventing forward progress on MAC transactions. When the panic timer fires, any pending TLB misses are aborted and all MAP-mem/SQ transactions that TLB miss are instead shunted to the pre-configured physical address stored in PANIC_PA. This guarantees forward progress so that system software can access Rshim registers for debug and reset. 4.10Connection to mPIPE It is sometimes desirable to treat data from the PCI system as “packet” data and pass it through the TILE-Gx processor’s mPIPE™. This allows, for example, packets collected by a host-connected Network Interface Card (NIC) to be off-loaded to TILE-Gx for processing as in Figure 4-8. 82 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Connection to mPIPE TILE-Gx PCIe NW Offload Card Application D D R 3 Pull DMA Driver PDE PCIe Network NIC Host Chipset Host Processor DRAM Figure 4-8: Host Offload Model To support distributing data to the mPIPE, TILE-Gx processor provides dedicated eDMA gather/ loopback channels that allow data from Tile memory space to be processed through the classification, load-balance, distribution, and notification services in the mPIPE’s iDMA path. This allows data to be collected into buffers by the PCIe controller’s Pull DMA function in the offload model shown above. Alternatively, if TILE-Gx processor is the root complex with an attached NIC, the NIC’s driver running on TILE-Gx processor will collect packet data into Tile memory; likely using the NIC’s push DMA. From Tile memory, the data can be sent through the mPIPE using the eDMA gather/loopback function (Figure 4-9). TILE-Gx PCIe NW Host/Appliance D D R 3 Application NIC Driver PDE PCIe Network PCIe NIC Figure 4-9: Hosted-NIC Model Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 83 Chapter 4 PCIe Controller Architecture (TRIO) 4.11Deadlock PCIe ordering rules are defined in the PCI Express® Base 2.0 specification in Table 2-24. These rules enforce PCI’s producer-consumer memory model and prevent deadlock. These rules include the following for deadlock avoidance: • Non-Posted requests never block posted requests. The PIO buffer, for example, will bypass stalled Non-Posted requests so that posted requests make progress. • Non-Posted requests never block completions. Packets arriving from the PCIe port are mapped into the Tile memory system as follows: Packet Type Region Type Dependencies Memory Read (NP) Tile Memory / Rshim TileMem (read/resp), PCI Completion NoMatch Tile SW, PCI Completion Tile Memory / Rshim TileMem (write) NoMatch Tile SW (Request Tracker) TileMem (write), IDN (interrupt) Memory Write (P) Completion (CPL) Because the Tile memory system is deadlock free and always drains, the dependencies are on Tile SW and PCI completions. PCI completions always drain. So Tile SW is the only true dependency that must be managed. In order to support Tile offload models that might introduce dependencies between non-posted and posted/completion transactions, the controller provides an optional non-posted ingress credit counter. This counter decrements each time a non-posted packet is sent via the software region. When zero, non-posted packets will not be dequeued from the MAC and posted/completion traffic will continue to make progress. The counter can be incremented or written by software. In order to improve the performance of posted/completion flows in the presence of a congested PCI completion flow, the controller allows posted/completion traffic received on the PCIe link to make progress even if a full PCI completion buffer is blocking incoming non-posted traffic. This is not strictly required for deadlock-free operation, but allows more deterministic posted write performance. 84 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 5 PCIe MAC I NTERFACE 5.1 Introduction TILE-Gx™ PCIe interfaces are connected to TRIO to integrate the off chip PCIe subsystem with TILE-Gx’s memory system. This is shown in Figure 5-1. SerDes Lanes SerDes Lanes SerDes Lanes PCIe/ StreamIO MAC0 PCIe/ StreamIO MAC1 PCIe/ StreamIO MAC2 5,2 TRIO (Transaction I/O) Tiles Figure 5-1: PCIe I/O Interface Subsystem Implementation Note: The TILE-Gx36™ device contains three PCIe ports that connect to a single TRIO instance. Each PCIe interface provides the following features: • PCIe Gen-2 support (5Gbps per lane) • Each port can be configured as endpoint or host • Each port can be replaced by a StreamIO instance for lightweight FPGA connections • TILE-Gx is bootable via endpoint, root-complex, or StreamIO • Auto-negotiated link width Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 85 Chapter 5 PCIe MAC Interface • Lane/Polarity reversal • Max Payload Size of up to 1024 bytes for efficient data transfer • Nonblocking diagnostics access via endpoint, root, or StreamIO • SR-IOV support • MSI/MSI-X plus legacy interrupt support • Support for mPIPE™ buffer descriptors for zero-copy packet-to-Tile-to-PCIe transfers • Compliant with the PCI Express® Base 2.0 specification • Programmable device capabilities including BAR sizes • Support for OEM of PCIe device via writable Vendor/DeviceIDs and other capability structures • Multiple simultaneous data movement models including push DMA, pull DMA, PIO, memorymapped, and mailbox/doorbell with no Tile software overhead • Low power and power-down modes supported in both the PHY and the MAC including active state (dynamic) power management, L1/L2 power down, and beacon/wake. • Crosslink support to allow identically configured endpoint or root ports to be interconnected • Advanced error reporting, function-level-reset, and vendor messaging capabilities 5.2 Register Spaces PCIe registers are accessible via MMIO space. See the TRIO_CFG_REGION_ADDR register specification. The physical address provided to TRIO is interpreted as follows to provide access to MAC/ MAC-interface registers: Figure 5-2: TRIO_CFG_REGION_ADDR Register Table 5-1. TRIO_CFG_REGION_ADDR Register Descriptions 86 Bits Name Type Reset Description 36:32 REGION RW 0 Selects CFG_SPACE 21:20 PROT RW 0 Unused for MAC address space 19:18 MAC_SEL RW 0 Selects the MAC being accessed. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Register Spaces Table 5-1. TRIO_CFG_REGION_ADDR Register Descriptions (continued) Bits Name Type Reset Description 17:16 INTFC RW 0 Interface being accessed. 15:0 REG RW 0 Value Name Meaning 0 TRIO Access to centralized TRIO registers. 8-byte oriented registers. 1 MAC_INTERFACE Access to per-MAC interface control registers (interrupts, serdes control etc.). 8-byte oriented registers. 2 MAC_STANDARD Access to per-MAC registers (PCIe config space etc.). This interface is typically only used by BIOS/discovery software since it treats BAR registers as read only and thus prevents BAR resizing. 4-byte oriented registers. Also supports 1 and 2 byte operations. The upper 4bytes of an 8-byte store will be discarded. The upper 4bytes of an 8-byte load will be zeroed. 3 MAC_PROTECTED Access to per-MAC registers (allows writing of BAR registers). 4-byte oriented registers. Also supports 1 and 2 byte operations. The upper 4-bytes of an 8-byte store will be discarded. The upper 4-bytes of an 8-byte load will be zeroed. Configuration register to be accessed. Note that TRIO and MAC_INTERFACE registers are always aligned on 8-byte boundaries and access is always 8-bytes at a time. MAC registers are 4-byte oriented. 5.2.1Type-0/1 and Virtual Function Configuration Space Access to Type-0 (endpoint) and Type-1 (root complex) configuration space is provided within the MAC_STANDARD and MAC_PROTECTED address spaces. To access the type-0 config space for virtual functions (SR-IOV), the TRIO_PCIE_INTFC_VF_ACCESS register must be programmed with the target virtual function number. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 87 Chapter 5 PCIe MAC Interface 5.3 Port Configuration Each PCIe port may be configured via either strapping pins or MMIO registers. See the TRIO_PCIE_INTFC_PORT_CONFIG register for strapping and software port configurations. If the port is enabled via strapping pins, the port will train automatically without software running on the Tile. However, if the port is to be trained after software boot, the TRIO_PCIE_INTFC_PORT_CONFIG register can be used to select the device type (StreamIO vs. PCIe-Root vs. PCIe-Endpoint) as well as other port-specific settings. Once a port is enabled, it will automatically train to the widest possible link width and fastest possible link speed. Polarity and lane reversal will also be auto-negotiated. 5.4 IO Address Mapping The 64-bit PCIe address space is separate from the Tile physical address space. Address translation is used to provide protections of both the PCIe and the Tile address spaces. The addresses of PCIe requests arriving from the MAC are translated to Tile physical addresses using the map regions and IO TLB / IO MMU mechanisms described in Section 4.8 Ingress Mapping Regions. 5.4.1Boot and Diagnostics Access Access to TRIO’s boot and Rshim region (see Section 4.8.3 Boot and Rshim Regions) is typically through the low 1MB of BAR0 (endpoint) or low 1MB of the PCIe address space (root). The Rshim/boot region is relocatable at runtime. In order to allow boot and debug access regardless of the BAR0 offset, incoming BAR0 accesses have their upper address bits masked. This causes the same (low) address bits to be passed to TRIO regardless of where system software locates the Tile-Gx BAR0 within the PCIe address space. The masking function is programmable via the TRIO_PCIE_INTFC_RX_BAR0_ADDR_MASK. 5.5 Interrupts The PCIe interface supports legacy, MSI, and MSI-X interrupt mechanisms as well as interrupt signaling through the MAP-MEM regions in TRIO. As a PCIe endpoint, MSI/X and legacy interrupts may be dispatched to the root-complex via the TRIO_PCIE_INTFC_EP_INT_GEN register. As a PCIe root complex, legacy interrupts from devices are reflected as INT_LEVEL/INT_DEASSERT/INT_ASSERT interrupts in the TRIO_PCIE_INTFC_MAC_INT_STS interrupt status register. MSI/X interrupts to a root complex port arrive as writes that may be mapped by system software into the MAP-MEM interrupt registers. StreamIO may also use the MAP-MEM interrupt registers. TRIO’s MAP-SQ doorbell interrupts may also be used for application-level interrupts. 5.6 Power Management PCIe power management support is provided by hardware. Software may initiate certain power management transitions via writes to the port’s PM D-State register and the TRIO_PCIE_INTFC_PM_INTFC_CTL register. Active state power management (ASPM) L0s/L1 transitions are handled completely by the hardware as are the activities associated with D0-3 state transitions. 88 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Link Down Handling Transitions into and out of L2/L23 require software interaction via the XMT_TURNOFF, XMT_PME, and LINK_RESTART bits in the TRIO_PCIE_INTFC_PM_INTFC_CTL register. 5.7 Link Down Handling PCIe ports may go down due to surprise-remove, power management transitions, excessive link errors, or system reset. When this occurs, transactions targeting the port must be flushed out of the system to allow transactions targeting other ports to remain unaffected. Read requests from TRIO Pull DMA are automatically “timed out” when their link goes down. PIO requests will timeout naturally. These timeouts will automatically restore the link’s tag space and completion credits. Similarly, pending write requests from PIO and Push DMA as well as ingress MAP completions will be flushed out of the port to prevent blocking of transactions to other ports. The link can be brought back up by software or can automatically retrain and be used as normal. 5.8 SERDES Configuration The SERDES configuration (PLL settings, drive strength, de-emphasis, equalization etc.) is typically handled by the hardware automatically. However, if customized settings are needed, the TRIO_PCIE_INTFC_SERDES_CONFIG register provides an access mechanism to SERDES-specific settings. The SERDES registers are not documented as part of the TILE-Gx IO Guide. 5.9 Streaming Interface The TILE-Gx PCIe supports a SERDES-based streaming data interface for transport of bulk data to and from external devices such as FPGAs. When the streaming interface is used, it replaces the associated PCIe MAC with a lightweight datalink layer which provides packetization, lane bonding, symbol encoding, error detection, and flow control without the need for a complete PCI Express feature set (configuration registers, etc.). The interface supports from 1 to 4 lanes at rates up to 6.25Gbps per lane. An inband flow control mechanism provides simple credit-based flow control of the datalink buffer. The streaming interface supports all of the data movement models described previously including: • Push DMA: Moves data from Tile memory to remote device. • Pull DMA: Moves data from remote device to Tile memory via “reads”. • MAP MEM: Writes and reads from remote device mapped into Tile memory. Provides support for ordered, unordered (pipelined), and interrupt traffic. • MAP SQ: Writes and reads from remote device mapped into Tile memory with Tile-side descriptor FIFO specifying the address and doorbell register for interrupts. • Tile PIO: Loads and stores from Tile mapped to remote device writes and reads. • Boot/Debug: Accesses to the Rshim from the remote device for boot and debug via a dedicated address space. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 89 Chapter 5 PCIe MAC Interface 5.9.1Packetization The streaming data is packetized into fragments from 8 bytes to 1KB. This allows insertion of credit and clock compensation frames at regular intervals and retains compatibility with the PCI Express DMA infrastructure which has a 1KB max payload size. Each packet contains a 64-bit address. This represents the I/O address for push and pull DMA transactions. For streaming data sent from the remote component, the address represents the buffer location for the data. To implement a ring buffer, the address would be incremented by 1024 on each packet and wrapped back to zero at the ring boundary. Double buffer schemes would be similarly implemented by the remote device by keeping track of the buffer and offset being written. 5.9.2Interrupts Interrupts can be delivered from the remote device via the streaming interface by using one of the MAP SQ doorbells or by using a MAP MEM region in MSI mode. 5.9.3Flow Control The streaming interface's link layer provides flow control for a small FIFO between the transaction and link layers. This flow control allows lossless transfer of data regardless of the resources or clock rates provided on the remote device. The FIFO is typically sized just to cover the bandwidth delay product of credit updates accounting for the packet fragmentation size being used. Typically this buffer will be between 1-4KB to maintain line rate of the interface. 90 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 6 M PIPE A RCHITECTURE 6.1 Overview The TILE-Gx™ Multicore Programmable Intelligent Packet Engine (mPIPE™) provides line rate services for the packet interfaces. These services are also available for packet data stored in onchip buffers such as flows being moved to and from a PCI interface. Some example services supported by the mPIPE are: • Parse: Identify the flow of each incoming packet • Packet Distribution: Make ingress packet data available for worker processing • Buffer Management: Track and distribute packet buffer resources • Load Balance: Spread work across multiple Tiles • Checksum: calculate the L4 checksum on ingress and egress traffic • Gather: Collect packet data from Tiles – potentially scattered across multiple buffers • Egress: Send packet data to the wire Functionality of the mPIPE is divided between ingress and egress described in the following sections. 6.1.1 Glossary The following terms are used throughout this chapter: • MAC: A physical device connected to the mPIPE. Can include more than one channel. • Port: The same as MAC. • Channel: A set of physical resources in the mPIPE. Each MAC is associated with one or more channels. Depending on system configuration, more than one MAC might share a channel. For example, when the two MACs share a set of input pins and hence can’t be in service simultaneously. • eDMA Ring: An egress descriptor ring. Each ring is associated with exactly one channel and hence exactly one MAC/Port. • Priority Queue: A set of virtual resources in the mPIPE. Ingress packets are assigned to a priority queue by the MAC. Egress packets are queued based on the configuration of their associated eDMA ring. • NotifRing: Data structure stored in Tile memory containing ingress packet descriptors created by the classifier. • Classifier: Processor that parses incoming packets and determines to which flow the packet belongs. • Load Balancer: Assigns incoming packets to NotifRings. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 91 Chapter 6 mPIPE Architecture For more terminology refer to “Glossary, Conventions and Standards” on page 585. 6.1.2 PHY and DMA Sharing TILE-Gx™ processor shares its SERDES lanes between many different interfaces. The specific interfaces vary depending on the device within the TILE-Gx processor family. Implementation Note: The TILE-Gx36™ device supports the following interfaces connected to the mPIPE: Four XAUI Sixteen Gigabit Ethernet (CDR-based SGMII) Since these interfaces share SERDES lanes, not all configuration cross products are possible. Similarly, a common mPIPE is shared between the interfaces described above. This sharing of PHYs and packet processing services is shown in Figure 6-1. SERDES Lanes PHY Distribution Layer XAUI MACs GbE MACs Interlaken MACs MAC Distribution Layer Channelized iDMA Channelized eDMA mPIPE Tiles Figure 6-1: PHY/DMA Sharing 6.1.3 Channelization The mPIPE must manage traffic across multiple interfaces or multiple flows within the same interface (Interlaken). The mPIPE provides independent resources for each channel to support QoS and non-blocking flows. Implementation Note: The TILE-Gx36 device provides 20 channels mapped to up to 16 active MACs and 4 loopback channels. 92 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services 6.1.4 Channels vs. Ports There is a distinction between physical port and channel. A given physical port (such as SGMII and XAUI) is statically assigned to a channel. In the case of multi-channel interfaces such as Interlaken, the physical port is statically assigned to multiple channels. Note that a channel might be assigned to multiple ports. This can happen when two ports cannot be physically active at the same time in a system. For example, they might share a PHY or other physical resource. The MPIPE_MAC_MAP registers (MPIPE_MAC0_MAP, as an example) specify which channels are associated with each port (MAC). And the MPIPE_MAC_ENABLE register defines which ports are enabled. On ingress, the classification and load balancing steps described in detail later in this document define how packets arriving on a channel get distributed to various notification rings (workers). For egress, the eDMA rings described in detail later in this document each have a configurable output channel thus an eDMA ring is associated with a single output channel and a single output port. 6.1.5 Priority Queues To support priority queuing standards such as 802.1Qbb, the mPIPE provides virtual queues for both ingress and egress. Each MAC (port) determines how its traffic is mapped into the mPIPE’s priority queues. Egress traffic is assigned to queues based on the PRIORITY_QUEUES bit mask in the MPIPE_EDMA_RG_INIT_DAT_MAP register. Flow control for ingress traffic is provided on a per-queue basis and each MAC responds to the flow control based on the MAC’s configuration (for example pause frames, priority pause, or other inband or out of band flow control). Implementation Note: The TILE-Gx36 device supports 16 priority queues. 6.1.6 Communication Model The TILE-Gx processor memory system is optimized for efficient communication between Tiles, cache controllers, memory controllers, and I/O devices. Bulk data transfer is done via the memory system. The mPIPE supports caching hints to optimize cache and memory-bandwidth utilization based on the system’s locality attributes. Interrupts are delivered through the Interprocessor Interrupt (IPI) mechanism. 6.2 Ingress Services TILE-Gx processor’s ingress hardware manages incoming packets from the I/O channels. The ingress mPIPE parses the packets, writes the packet data into software-visible buffers, then load balances across worker Tiles. 6.2.1 Typical Ingress Flow Packets typically take the following steps through the ingress portion of the mPIPE. Subsequent sections describe the mechanisms used to implement this flow. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 93 Chapter 6 mPIPE Architecture MAC Distribution 1 Classification iPkt Buffer 2a 2b DMA CMDs 2c Load Balance Descriptor Buffer 3 iDMA Worker Notification Buffer Manager 4 5 Write Packet Data Write Descriptor Data 6 Notify Worker Figure 6-2: Typical iDMA Flow 1. A packet is assembled by PHY and MAC layers and presented to channelized iDMA. 2. The packet is classified in order to identify the flow and choose a buffer pool. The classification steps generate a packet descriptor containing the following: a. Buffer Pool / DMA control. b. Bucket (typically computed from the flowID). c. Custom fields (for example: flow hash) passed to software. 3. The load balancer chooses a worker, based on the hashed FlowID. 4. Packet data is written to the chosen buffer pool, typically into the Tiles’ L3 cache. 5. A packet descriptor is written into a ring in memory space, typically local to the worker. 6. The worker is notified that a new packet descriptor is available via interrupt and/or update to the ring’s tail pointer. 6.2.2 Buffers Packet data is stored in buffers. Each buffer has an associated descriptor that defines the buffer’s attributes (virtual address, size, chaining). The buffer descriptor format is shown in Figure 6-3. 94 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 C Size HWB Reserved StackIDX VA[41:7] Reserved 9 8 7 6 5 4 3 2 1 0 Offset 0x00 Figure 6-3: Buffer Descriptor Formats Table 6-1. Buffer Descriptor Formats Bits Fields Description 63:62 C Chaining Designation. Set by iDMA hardware and application SW prior to eDMA 0 Unchained buffer pointer 1 Chained buffer pointer. Next descriptor stored in 1st 8-bytes in buffer. 3 Invalid Descriptor. Could not allocate buffer for this stack (iDMA) or end of chain (i/eDMA). 61:59 Size Size of Buffer. Encoded as follows: 0 128 bytes 1 256 bytes 2 512 bytes 3 1024 bytes 4 1664 bytes 5 4096 bytes 6 10368 bytes 7 16384 bytes 58 HWB Hardware Buffer. Indicates that this is a hardware-managed buffer. This bit will always be set for ingress packets. On egress, this bit indicates that the buffer should be returned to the hardware buffer manager. 57:53 Reserved Reserved 52:48 StackIDX Buffer stack to which this buffer belongs. 47:42 Reserved Reserved 41:7 VA Virtual address bits 41 to 7. Buffers are always aligned to 128-bytes, though packet data might be offset within the buffer. 6:0 Offset Start byte of data within the 128-byte aligned buffer. The buffer descriptor provides a VA. The stack to which a buffer belongs provides an address space identifier (ASID) to allow independent address spaces concurrently within the same system. See section “Virtual Memory” on page 133. MiCA accelerator blocks, as described in “Common Accelerator Interface (MiCA)” on page 175, also are capable of reading/writing buffers that are written/read by mPIPE. However, note that only mPIPE accesses the Buffer Stacks (see 6.2.2.1 Ingress Services). 6.2.2.1 Buffer Stacks Buffer descriptors are stored in stacks that are managed by a buffer stack engine in the mPIPE. Each time the iDMA engine requires a new buffer, it pops a descriptor from the associated stack managed by the buffer stack engine. Each time the eDMA engine frees a buffer, the descriptor is pushed back onto the stack by the buffer stack engine. Software can also return buffers to the stack by sending a message to the buffer stack engine. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 95 Chapter 6 mPIPE Architecture By storing buffer descriptors in stacks, TILE-Gx processor optimizes temporal locality of packet data. This locality also allows a portion of the stack to be cached by the buffer stack engine, thus reducing the bandwidth required to read buffer descriptors from Tiles and memory. The mPIPE supports multiple buffer stacks to allow size differentiation such as {small, medium, large, jumbo}. Multiple stacks also allow QoS guarantees for different flows and/or applications and supports buffer regioning such that a group of workers assigned to a given application can have their buffers homed on a specific Tile or hashed-for-home within a set of Tiles. Implementation Note: The TILE-Gx36 device supports 32 buffer stacks. The buffer stack used for an incoming packet is chosen during classification. Each buffer stack corresponds to an ASID. The TLB entry associated with a buffer’s VA and ASID provides the caching attributes including NonTemporal hint (NT-HINT), Pinning, and homing information. These attributes include: • Buffer Size: All buffers in a stack must be the same size. • Background data policy: Specifies whether or not to fetch background data on partial cacheline writes or to zero unused bytes. • Write Miss Policy: Specifies whether or not to allocate in the cache when a write misses or send to main memory. • Write Hit Policy: Specifies whether or not to update the temporal hint (LRU) on a write hit. • Read Inval Policy: Specifies whether or not to invalidate the cacheline after a read to save memory bandwidth (note that subsequent accesses will see unpredictable data). 6.2.2.2 Buffer Chaining TILE-Gx processor’s iDMA flow supports automatic scatter via buffer chaining. When a packet exceeds the size of a single buffer, the packet is fragmented across multiple buffers and a link is created from each buffer to the subsequent buffer. The links are simply buffer descriptors written into the first 8 bytes of the buffer. The first 8 bytes of the final buffer in the chain are reserved/ unpredictable. Each buffer descriptor contains both the buffer’s VA (virtual address) and its offset. While the VA points to the very beginning of the buffer, data will be written into the buffer starting at the offset. Figure 6-4 shows a buffer chaining example where: • The Buffer size is 128 (120 bytes available for data). • Three 128B buffers are required for a 300-byte packet. • All buffers except for the first and last are full. • All buffer descriptors except the first one have an offset of 8. • The buffer descriptor in the final buffer is marked INVALID. See the exception for cut-through, as described in 6.2.2.2 Buffer Chaining and 6.2.2.3 Ingress Services. Note that the Buffer Descriptor shown in Figure 6-4 is associated with iDesc, however this is not the only place a buffer descriptor can be used. Buffer descriptors can also be part of eDMA Packet Descriptors (refer to Figure 6-17 “eDMA Descriptor Format” on page 124) or can even be used standalone (refer to Figure 11-5 “Using a List of Buffer Descriptors as a MiCA Destination Mode” on page 183). 96 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services Buffer Descriptor C=1, Offset=8 I2_size=300 iDesc Buffer-0 108 Valid Bytes Buffer Descriptor Size=0(128B), C-1, Offset=20 Buffer Descriptor C=1, Offset=8 Buffer-1 120 Valid Bytes Buffer Descriptor C=3 (INVALID) Buffer-2 72 Valid Bytes Figure 6-4: Buffer Chaining Example Packets that cut through (as indicated by the CT bit in the packet descriptor) will always use chained buffers, unless the buffer size is set to 7 (16384 bytes). In this case, the packet will always fit into the buffer or will be truncated by the hardware. For packets that have been designated to be handled by cut-through methods, the buffer descriptor’s chain-valid indicator cannot be used to determine the end of the buffer chain. Unlike storeand-forward methods, cut-through handling of packets does not need to write to the final buffer (indicating the end of the packet transmission). Instead, software must calculate the number of buffers that were used and follow the chain appropriately. When buffers are chained, the 7-bit offset field in the buffer descriptor includes the 8-byte buffer chain field. So, for example, if the data starts at byte 23 of the buffer with the chain in bytes 0-7, the offset field would be 23. The following rules summarize the relationship between buffer chaining and the buffer offset: • The offset field in the buffer descriptor is in bits[6:0], so the maximum offset is 127. • If the chain field is BDESC_CHAINED, the minimum offset is 8. • The classifier does not know if the buffer will be chained, so the offset it picks can have 8 added to it when it gets to the iDMA engine. • The classifier must never choose an offset greater than 119 if the buffer could possibly be chained. • The iDMA engine will apply the classifier-specified offset setting only to the first buffer in the chain. Hence all bdescs (buffer descriptions) in the buffer chain, other than the one in the pDesc, will have an offset of 8. • The eDMA engine expects all bdescs with a chain field of BDESC_CHAINED to have an offset of at least 8. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 97 Chapter 6 mPIPE Architecture • The eDMA engine will properly handle descriptors “mid chain” when the buffers are full of data (except for the header), that is, when their offset fields are equal to 8. 6.2.2.3 Buffer Release In many systems, the egress hardware can be used to release chained buffers back to the buffer stack manager. In these systems, software simply posts the eDMA packet. The hardware will collect and free the buffers. If software decides to discard the packet, the buffers can still be collected and freed by the egress hardware using the NoSend flow described in “NoSend Option” on page 133. Note that for chained packets that have cut-through, the final buffer descriptor stored in the last buffer in the chain will need to be released explicitly by software since the egress buffer collection hardware will not free this buffer. For more information see “Transaction Sizing and Buffer Offsets” on page 129. Software Buffer Release In systems where a hardware buffer release is not desirable or possible, software is required to maintain and/or release the buffers. A generic method for software to determine which buffer descriptors need to be freed from an ingress packet is shown in Listing 6-1.: Listing 6-1. Example Buffer-Release Algorithm total_bytes = pkt_descriptor.L2_SIZE + // packet bytes pkt_descriptor.OFFSET // offset (zero pad) 8 // 8 bytes of offset include the chain pointer if (!pkt_descriptor.chained) num_descriptors = 1 else // number of descriptors required – accounting for // chain pointers num_descriptors = roundup(total_bytes/(buffer_size-8)) // // // if extra descriptor for cut through packets Largest buffer size never chains on buffer error, there won’t be an extra buff desc (pkt_descriptor.CT && (buffer_size != 16384) && !pkt_descriptor.BE) num_descriptors++ // get 1st buffer descriptor from packet descriptor buf_desc = pkt_descriptor.BUFF_DESC while (num_descriptors--) next_buf_desc = *(buf_desc.VA & ~0x7f) free_buf(buf_desc) buf_desc=next_buf_desc Alternatively, in systems where packets might cut through, software could insure that the first 8 bytes of all buffers were always written with an invalid buffer descriptor prior to freeing any buffers. This would allow software simply to follow the buffer chain linked list and free all valid buffer descriptors, stopping once it reached an invalid descriptor. 98 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services 6.2.2.4 Buffer Stack Engine The buffer stacks are managed in hardware using the Buffer Stack Engine. System software assigns a physical address range for each stack. These ranges must be aligned to 64KB regions in physical address space. The stacks are configured using the MPIPE_BSM_INIT_CTL and MPIPE_BSM_INIT_DAT registers. The buffer stack engine maintains a prefetch buffer for the top of the stack allowing fast access by the iDMA engine and reducing Tile memory system bandwidth when a steady state of eDMA buffer returns is able to feed directly into the iDMA engine. Implementation Note: The TILE-Gx36 device prefetches up to 64 descriptors for each stack. Typically, buffers are consumed by iDMA hardware and released by eDMA hardware. However, software can manually post buffers via an MMIO write to the buffer stack engine. Software can also consume buffers by directing an MMIO read to the stack engine. MMIO reads to a buffer stack will only return a valid buffer indication if the stack’s prefetch buffer has descriptors available. If there are descriptors in the stack, but the prefetch buffer is temporarily empty, the MMIO read will return a descriptor with the chain-mode set to BDESC_NOT_RDY (2). Software can re-read until it gets a valid descriptor. Generally this condition will only occur if the low-water mark in the MPIPE_BSM_CTL register is set below the recommended value. If there are no descriptors in the stack, a MMIO read will return a descriptor with the chain-mode set to BDESC_INVALID (3). Hardware spills/fills descriptors to/from the Tile-memory based stack, thus the format of these descriptors is typically not relevant to software. However, software can choose to “preload” the memory-based buffer stacks or otherwise interpret the data. The format for the hardware-managed buffer stacks is provided in Table 6-2. Table 6-2. Hardware-Managed Buffer Stacks Byte7 Desc-1 Byte6 Byte5 Byte4 Byte3 Byte2 Desc-0 Desc-2 Desc-4 Desc-3 Desc-10 Desc-6 Desc-9 Tilera Confidential — Subject to Change Without Notice 0x28 0x30 Desc-10 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors 0x18 0x20 Desc-7 Desc-11 0x08 0x10 Desc-4 Desc-8 Byte0 0x00 Desc-1 Desc-5 Desc-7 Byte1 0x38 99 Chapter 6 mPIPE Architecture Each Desc is a compressed buffer descriptor with the following format: 39 35 34 Reserved 0 Buffer VA[41:7] Figure 6-5: Buffer Stack Manager Memory Format The format is repeated for each 64-byte block (12 descriptors), as described above. Buffers are stored in blocks of 12 descriptors so software must take care to add or remove blocks of 12 if it is manipulating the Tile-memory based stacks directly. Typically, software will add and remove buffers via the MMIO buffer post/fetch interface hence the format and blocking of descriptors is not relevant. 6.2.3 iDMA Packet Descriptors Each packet traversing the iDMA flow is assigned a packet descriptor. The packet descriptor is a 64-byte summary of the classification, load balancing, and buffer management aspects of ingress processing. The packet descriptor is delivered to the worker via memory space writes to notification rings. The packet descriptor also controls DMA processing and load balancing. Its format is shown in Figure 6-6. Note that this format is described in the mPIPE I/O descriptor header file. iD M A P a c k e t D e s c rip to r (En try in N o tifR in g ) B yte 3 B yte 2 B yte 1 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 B yte 0 C hannel CS NR TS SQ PS BE D est ME TR CT CE L2_S ize CTR0 C S U M _S E E D /V A L N otifR ingID X B uck etID C S U M _S T A R T CTR1 G P _S Q N / G P _S Q N _S E L P ack etS Q N T im eS tam p C S ize 1 V A [3 1 :7 ] S ta ckID X O ffse t V A [4 1 :3 2 ] 0x00 F illed by H W 0x04 M ust be filled by classifier 0x08 0x0C 0x10 0x14 0x18 0x1C 0x20 0x24 0x28 0x2C 0x30 0x34 0x38 0x3C H W or C lassifier R eserved G eneral U se (custom data from classifier to S W ) C ustom or H W B u ffe r D e s c rip to r Figure 6-6: iDMA Packet Descriptor 100 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services Table 6-3. iDMA Packet Descriptor Formats Bits Description Channel Source Channel. PS Enable PacketSQN Insertion. When 1, packet sequence number (packetSQN) will be inserted. When 0, packetSQN field can be filled with custom data from classifier. PacketSQN Packet Sequence Number. Assigned at notification time. TimeStamp Timestamp assigned at arrival from MAC. CS Checksum Generation Enabled. TS Enable TimeStamp Insertion. When 0, the timestamp field can be used for custom data by the classifier. When 1, hardware inserts the timestamp that was captured when the start of packet was received from the MAC. CSUM_SEED Initial seed for checksum (from classifier), later filled with CSUM result by HW. VAL CSUM_Start Start Byte for Checksum. L2_Size Final L2 size of Packet. Written at notification time. Does not include preamble or CRC unless those fields enabled for pass-thru from MAC. ME MAC Error. Generated by the MAC Interface. Asserted if there was an overrun of the MAC's receive FIFO. This condition generally only occurs if the mPIPE clock is running too slowly. CE L2 CRC Error. Generated by the MAC. Asserted if MAC indicated an L2 CRC error or other L2 error (bad length, etc.) on the packet. CT Cut-through. When asserted, the packet was not completely received before being passed to classifier. The L2_Size field indicates the number of bytes received so far. TR Truncate. The packet was truncated due to out-of-space in the iPkt buffer . BE Buffer Error. Indicates it ran out of buffers for this stack. SW must still free any buffers in the chain. BucketID BucketID Filled by Classifier. NR NotifRingIDX is going to be determined by classifier instead of load balancer . StackIDX Buffer stack to use for this packet. NotifRingIDX NotifRing to write pDesc into. Typically filled by load balancer, but can be overridden by classifier with NR bit. GP_SQN Sequence number applied when packet is distributed. Classifier selects which sequence number is to be applied by writing the 13-bit SQN-selector into this field. SQ When asserted, the GP_SQN_SEL field contains the sequence number selector and the GP_SQN field will be replaced with the associated sequence number. When clear, the GP_SQN field is left intact and be used as “Custom” bytes. Size Filled by the stack manager based on the buffer chosen from StackIDX. VA C Offset Start offset within the buffer for the packet data. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 101 Chapter 6 mPIPE Architecture Table 6-3. iDMA Packet Descriptor Formats (continued) Bits Description Dest 0 1 2 3 CTR0 Drop Packet, Drop pDesc Write packet to buffer(s) from BufferIDX. Descriptor sent to load balancer for notification. Write pdesc, drop packet (used for recirculated packets where packet data already in Tile buffers) RSVD Encoded Counter Selects. The associated counters are incremented when the packet is sent. CTR1 6.2.4 Notification Rings Packet descriptors are written into rings stored in memory space. These rings represent work queues for individual Tiles. There are more rings than worker Tiles to allow a given worker to have multiple work queues. This provides the capability to have high priority packets, for example, assigned to its own ring. Implementation Note: The TILE-Gx36 device supports 256 Notification Rings. Each notification ring is associated with a specific worker Tile. The NotifRingTable in the mPIPE stores the memory location of the ring, the TileID of the Tile assigned to receive notification messages (if enabled), the ring size (126, 510, 2046, or 65534 descriptors), and the current ring count. This table is accessible via the MPIPE_LBL_INIT_CTL/MPIPE_LBL_INIT_DAT registers. Each ring holds two fewer descriptors than the associated memory footprint. This allows for one entry to hold the tail-pointer data and one entry, so that software can distinguish empty versus full by comparing head and tail values. If software requires more than 65534 packets to be enqueued for a given ring, it must copy the 64-byte packet descriptors to a software-managed ring. The mPIPE writes packet descriptors into the notification ring assigned by the load balancer. After the descriptor is written into the ring, the tail pointer is updated both in the NotifRingTable and in the Tail field stored in the Notification Ring itself. The Tail field is updated with either an 8-byte or 64-byte write operation as specified by the TUP_PTC configuration (tail pointer update) bit of the MPIPE_NTF_CTL register. In all cases, the first 64-bytes of the ring are NOT used for packet descriptors thus the tail pointer will never be zero. Software must initialize its head pointer to 1. Each time it processes a descriptor, it should increment by 1. Once it passes the ring size, it must be set back to 1. For example, in a size=2048 ring (up to 2046 descriptors), the increment could be done by: head = head+1; head = (head & 0x7ff) + (head >> 11); The worker Tile can poll on the tail pointer stored in front of the NotifRing to see when a new packet becomes available (for example head != tail). Optionally, an interrupt can be sent to the worker when the tail has been updated. The tail pointer stored in the NotificationRing represents the next ring location to be written. Hence, the first descriptor written into the ring will result in the Tail value being written to a 2, indicating that the descriptor in location 1 is valid (location 0 contains the tail pointer and padding to the next 64 bytes). 102 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services Packet Data Packet Descriptor (64 Bytes) Buffer Descriptor (8 Bytes) Tail Pointer, Also Stored in NotifRing Table Tail Head Updated by Message from Worker, Stored in NotifRing Table NotifRing PDE -- (Multiple Rings -- Stored in Memory Space) NotifRing PDE -- (Multiple Rings -- Stored in Memory Space) NotifRing PDE -- (Multiple Rings -- Stored in Memory Space) Figure 6-7: Notification Ring Data Structure Format 6.2.5 Store-and-Forward vs. Cut-Through Ingress store-and-forward operations allow the packet distribution hardware to make buffer and load balance decisions after the complete packet has been received. This provides, for example, the true L2 size and CRC validation to the classification stage. For high bandwidth flows, the increase in latency due to store-and-forward is generally not significant. However, in a multi-channel configuration, the additional storage and data-copy required for large frames can make store-and-forward impractical due to area, power, and bandwidth constraints. Implementations might require the use of cut-through operations in certain configurations. Implementation Note: The TILE-Gx36 device provides 192KB of ingress buffering for store-and-forward flows. This allows store-and-forward for up to 4 channels if jumbo frames are supported on the interfaces. It also allows store-and-forward for up to 24 channels of non jumbo frames. In configurations where cut-through must be used, the interface will still store-and-forward frames up to the threshold in the CUTTHROUGH bit of the MPIPE_IPKT_THRESH register. This allows, for example, the classifier to know the exact L2 size for packets up to 1600 bytes. Beyond this point, the classifier will only know that it is “larger than 1600”. 6.2.6 Classifier In order to provide flexible parsing and compatibility with future protocols, the mPIPE uses a programmable classifier. The classifier’s job is to generate a packet descriptor based on the incoming packet headers. This packet descriptor identifies the flow for load balancing and ordering, controls where the data is written, and identifies exception flows. The classifier parses all of the L2 and some or all of the L3 headers in order to identify, for example, the IP source and destination and determine the L4 octet offset. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 103 Chapter 6 mPIPE Architecture 6.2.6.1 Parallel Processing The classification step requires significant processing of the packet header. In order to provide a flexible and scalable solution, parallel classification engines are used. Since the classification step is stateless, there is no communication required between the engines and the performance scales linearly with the addition of more classification engines. Hardware in the mPIPE ensures that the packet order is maintained through classification, hence the implementation of many parallel classifiers appears to the user as a single high-speed classification processor. Implementation Note: The TILE-Gx36 device contains 10 classifiers, each running at up to 1.6GHz. This provides up to 268 cycles to process each packet at the maximum arrival rate of 60 million packets per second (40Gbps at minimum Ethernet packet size). For more information, refer to the TILE-Gx36 Preliminary Data Sheet (DS400). 6.2.6.2 Cycle Budget Each packet has a classification cycle budget based on the wire time of the packet. The classifier’s distribution and reorder architecture allows longer packets to take extra time for classification. The BUDGET settings in the MPIPE_CLS_CTL register are used to determine the budget for each packet according to the following formula: CycleBudget = ((min(max(L2_SIZE,BUDGET_MIN)+BUDGET_OVHD),255)) * BUDGET_MULT/128) + BUDGET_ADJ BUDGET_MULT is typically calculated at system configuration time based on the following formula: BUDGET_MULT = 128 * num_classifiers * freq / line_rate Where line_rate is in megabytes per second and freq is the frequency of the classifier in MHz. The BUDGET_ADJ setting compensates for integer truncation during the BUDGET_MULT and CycleBudget calculations and also for the budget-expired exception time (3 cycles). If line rate is required even when budget exceptions occur, BUDGET_ADJ must be -3 or smaller. Once a header has exceeded its cycle budget, the classifier can terminate processing on the header and jump back to PC zero (packet count=0) in order to start working on the next packet. When this occurs, the packet descriptor will be assigned a fixed DEST, NotifRing, and BufferStack based on the configuration in the MPIPE_CLS_CTL register. This can be used to debug the classification program and is not generally intended for use in normal operation. The setting the CLSQ_HWM bit in the MPIPE_IPKT_THRESH register determines when the cycle budget will be monitored. This allows some margin for occasionally exceeding the budget as long as the classification queue is not filling up. Using the cycle budget to drop the current header allows the classifier system to drop a packet that is taking too long, rather than dropping unrelated packets that have accumulated behind it. For example, a management flow can be guaranteed to be handled by the classification program and will never be victimized by a flow that is causing the classifier to exceed its cycle budget. The cycles remaining in the budget and state of the high water mark are readable by the classifier program as an SPR in CLASSIFIER_BUDGET to allow programs to make dynamic decisions based on time remaining. 104 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services 6.2.7 Processor Architecture The classifier, illustrated in Figure 6-8, is a compact 16-bit RISC processor with several special instructions and datapaths to accelerate packet header parsing. Due to its simplified design, the classifier has very few data-induced stalls. As a result, the performance is easy to predict and analyze. Once the worst case path through the instruction flow is determined, the cycle count can be calculated or simulated to determine the maximum packet rate that can be supported. CBR Mispredict pred Taken Exception jr Instruction Instruction Memory From iPkt + Instruction fPC immd (srcB Only) Decode 2 Ports Header Bytes (16 bits) iHdr hPtr Bypass Tbl (4KB) + hPtr Regfile (23 x 16bit) HSH accums tPtr + srcA srcB SPR ALU HSH + pdPtr pDesc (64 bytes) Figure 6-8: Classifier Architecture1 Headers from incoming packets are written into the iHdr buffer (This buffer is 256B.). This buffer is directly addressable from the program to provide zero-latency access to header bytes. The output from the processor is a packet descriptor containing the buffer pool index, DMA control, and custom data. The processor utilizes a 16-bit datapath for maximum efficiency in L2/3 processing. 1. The blocks with plus signs (+) indicate incrementers. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 105 Chapter 6 mPIPE Architecture In addition to typical ALU operations, the processor supports several special purpose functional units. 6.2.7.1 Header and Descriptor The classifier interacts with the rest of the mPIPE ingress hardware via the packet header and packet descriptor buffers. Instructions can directly consume bytes from the header and directly write bytes to the descriptor by using dedicated register specifiers. The header buffer is read-only and contains the first 256 bytes of the packet. The buffer is addressed with the hPtr and the bytes can be consumed by any instruction. The hPtr can optionally be post incremented whenever bytes are consumed by the program. The program decides which bytes of the header to use for classification. The pointer can also be explicitly written in order to provide random access to the header buffer. The descriptor buffer is a write-only 64-byte structure addressed by the pdPtr. The program can write 1 or 2 bytes to the structure with any operation. Similar to the hPtr, the pdPtr can be explicitly written to provide random access. pdPtr is not readable from the classifier program. Both the hPtr and pdPtr are set back to their initial values whenever the PC transitions to zero due to a branch, jump, or exception. 6.2.7.2 Table Lookup The 4096 byte read-only memory provides lookup capabilities for MAC address matching, VLAN information, or policy decisions. The table is written by Tile software during initialization. The classifier program accesses this table via an indirect mechanism much like iHdr. The tPtr register (R24/mempos) can be written by any instruction and provides a byte address into the table. Instructions that read R24/mem2 will retrieve two bytes at the 2-byte aligned tPtr address and tPtr will be incremented by 2 (LSB of address is ignored). Instructions that read R23/mem1 will retrieve one byte from the table and tPtr will be incremented by 1. tPtr is not readable from the classifier program. 6.2.7.3 Special Registers The classifier uses the instruction’s register specifiers to encode special access to the iHdr buffer and pointer, the pDesc buffer and pointer, and the table and table-pointer. The register encodings are defined in Table 6-4. Table 6-4. Classifier Register Specifier Encodings 106 Register Behavior as Source Behavior as Destination R0-R21 GPRs R22 Read hPtr Write hPtr R23 Read table[tPtr++(1)] (mem1) Write pdPtr R24 Read table[tPtr++(2)] (mem2) Write tPtr R25 Read iHdr[hPtr] (peek2) Write pDesc[pdPtr++(1)] (put1) R26 Read iHdr[hPtr++(2)] (get2) Write pDesc[pdPtr++(2)] (put2) R27 HASH0_LO R28 HASH0_HI R29 HASH1_LO Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services Table 6-4. Classifier Register Specifier Encodings (continued) Register Behavior as Source R30 HASH1_HI R31 Zero 6.2.7.4 Behavior as Destination Null Hash Accumulator Two 32-bit hash accumulators provide symmetric hashing of the flowID/Tuple. Either one of the hash accumulators can be run in parallel with a checksum operation on the same source bytes. Each hash accumulator is a pair of 16-bit GPRs. The hash instructions always accumulate into the same pair of GPRs thus an accumulator destination operand is not needed. Both an 8 and a 16-bit accumulate instruction are supported. The hash function is a CRC using the same 232 polynomial as the crc32 instruction provided in the Tile. This is the same polynomial used for Ethernet CRC. The instruction set for the classification processor is defined in Appendix B: Classifier Instructions and SPRs. 6.2.7.5 Endianness When reading bytes from the iHdr or the Table, the bytes are interpreted as being big endian. The byte pointed to by hPtr/tPtr feeds bits[15:8] of the source operand and the byte pointed to by (hPtr+1) feeds bits[7:0] of the source operand. This allows multi-byte fields such as EtherType/Len to be properly interpreted by the classifier. When writing two bytes into the pDesc, bits[7:0] are written into the byte pointed to by pdPtr. Bits[15:8] are written into the byte pointed to by (pdPtr+1). Hence, to copy bytes from the iHdr buffer to the pDesc buffer, the program needs to swap the bytes using an instruction or sequence similar to the one below. // copy two bytes from iHdr to pDesc maintaining byte order rotli put2, get2, 8; The SEED field in the pDesc is interpreted in network order by the iDMA checksum calculator. This means that bits[15:8] correspond to the CSUM_START byte of the packet and bits[7:0] correspond to the CSUM_START+1 byte of the packet. The resulting checksum is placed into the pDesc with the byte corresponding to earlier packet check summed bytes in bits[7:0] and later check summed bytes in bits[15:8]. 6.2.7.6 Header/Descriptor Valid Indicators Hardware in the ingress mPIPE writes the header data into the iHdr buffer then sets an internal valid bit indicating that the classification program can consume the packet header. When the PC is set to 0 due to a branch, jump, or exception to PC zero, the classifier automatically clears its internal header-valid indicator. Similarly, hardware reads the packet descriptor when the PC is set to 0 due to a branch or jump instruction or an exception directed to PC zero. A double buffering scheme is used on both the iHdr and pDesc structures to allow pipelined operation of the classifier processor array. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 107 Chapter 6 mPIPE Architecture All classifier programs MUST have at least one iHdr read or an MFSPR from one of HEADER_FLAGS, CHANNEL, or L2_SIZE. Without accessing one of these structures, the program will not properly interlock with incoming packet headers and iPkt buffer corruption will occur. 6.2.7.7 Classifier Pipeline The classifier’s pipeline is bypassed to provide single cycle latency of most operations. Branches and jumps incur extra latency to update the PC. This is visible to the programmer as additional cycles to execute the program; there are no delay slots. 6.2.7.8 Stalls When the program reads the iHdr buffer and data is not valid, the consuming instruction will stall until data becomes valid. When stalled, the processor is in a low power state. Hence the program simply needs to read the iHdr buffer to automatically wait for the next packet header to arrive. When the hPtr is explicitly written by an instruction, the iHdr buffer will become invalid for 3-4 cycles while the iHdr buffer is read at the new location. The program can execute any instructions within these 3-4 cycles, however if an instruction in this window reads the iHdr buffer, it will be stalled until valid iHdr data is provided. The classifier also stalls when the packet descriptor buffer has not yet been drained by the classifier’s “join” function and a new header has arrived. Double buffering is provided both on the iHdr and descriptor buffers and stalls due to descriptor-buffer-full do not occur during normal operation. The stall conditions are summarized in Table 6-5. Table 6-5. Classifier stall Conditions Stall Condition Instructions Affected Cycles Write to hPtr (RAW) Consumers of hPtr (getpos) 1 Consumers of iHdr (get2/peek2) 4 if new hPtr%8=7. 3 Otherwise Write to tPtr (RAW) Consumers of the table (mem1/mem2) 3 Write to SPR (RAW) MFSPR 1 Write to hash accumulator (RAW) Any instruction that reads r27-r30 via SrcA or SrcB will incur a single cycle stall if the previous instruction updated the associated accumulator. 1 pDesc buffer full (iDMA stall) All Stalls until space available iHdr not valid (waiting for new packet) Consumers of iHdr (get2/peek2). MFSPR from HEADER_FLAGS, CHANNEL, L2_SIZE SPRs. Stalls until new iHdr arrives Conditional branches provide a static prediction hint to optimize the common or critical case for the branch. The pipeline latencies for the various PC updates are summarized in Table 6-6. Updates to the header pointer other than the built in post-increment incur a three or four-cycle delay (the delay will be four cycles if the new hPtr%8 = 7). During the update window, operations that consume the iHdr data will stall. Instructions that do NOT consume iHdr will not stall, so better performance will be achieved if non-iHdr instructions can be inserted after an hPtr update. 108 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services Table 6-6. Classifier Instruction Latencies Instruction Type Prediction Actual Instruction Latency Conditional Branch Not Taken Not Taken 1 cycles Not Taken Taken 3 Taken Not Taken 3 Taken Taken 2 Jump-Register 3 Exception 2 All Other Operations 1 hWin Updates add hPtr,r1,r2 F unrelated op (update cycle0) unrelated op (update cycle1) unrelated op (update cycle2) unrelated op (potential update cycle2) add r5,*hPtr++,r6 O F E O F re0 E O F re1/rdat0 E O F rdat1 buf-vld E O F E O E Figure 6-9: iHeader Pointer Update Latency 6.2.7.9 Persistent State Although the classification processor is intended for stateless packet processing, some state is retained from packet to packet. Much of the program-visible state is reset however. Table 6-7 summarizes the architectural state of the classifier. Table 6-7. Classifier State Item Classifier Access Tile SW Access Action when PC Transitions to Zero GPRs Read/Write Write Only State left intact Hash Accumulators Read/Write None State left intact Lookup table Read Only Write Only State left intact SPRs Varies - see SPR descriptions pDesc Pointer Write Only Write Only (init value) Set to initial value iHdr Pointer Read/Write Write Only (init value) Set to initial value tPtr Write Only None Set to 0 pDesc Write only None All bytes set to zero iHdr Read Only None Updated to next header Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 109 Chapter 6 mPIPE Architecture 6.2.7.10 Exceptions The classifier supports a single exception flow for the case where the program attempts to consume header bytes from beyond the end of the iHdr buffer or beyond the L2 size of the packet. This can occur, for example, when the classification program encounters an arbitrarily deep encapsulation of protocols which exceed the reach of the iHdr window. The exception handling is provided through a programmable exception target PC. When the classifier generates an exception, the classifier’s PC is directed to the programmed exception handler location. Handling might simply consist of interrupting the Tile and freezing the classifier, or it can execute special case instructions and branch back to PC=0 and continue processing the next packet. The exception target PC can also be set to 0, which causes the current header and descriptor to be considered complete upon exception with no further exception processing needed. The iHdr pointer is considered invalid if it points to the last byte of the iHdr window (0xff) or the last byte of the packet since only one of the two bytes would be valid. If classifier software needs to access the very last byte, it must set the iHdr to L2_Size-2 or 254 (whichever is smaller). 6.2.7.11 Classifier Configuration At power on, the classifier’s enable bit is cleared, which freezes the classifier at PC=0. The instruction memory, exception PC, initial hPtr, initial pdPtr, and Table contents of the classifier are loaded by Tile software as part of mPIPE initialization. Additionally, the GPRs can be preloaded to provide constants for the classification program. Once the program has been loaded, Tile software sets the enable bit and the classifier will begin processing incoming packets. The classifier instruction-memory, lookup-table, and GPRs are configured by writing to the MPIPE_CLS_INIT_CTL and MPIPE_CLS_INIT_WDAT registers. Any or all of the classifiers can be initialized simultaneously. The initial pdPtr and hPtr are configured by writing to the associated GPR specifiers via MPIPE_CLS_INIT_CTL/MPIPE_CLS_INIT_WDAT. The exception PC is configured by initializing with GPR=R24(mempos). Runtime changes to the classifier’s program can be made by disabling an individual classifier and reloading program data. If the performance from the remaining classifiers is insufficient to process the incoming traffic flow, packets will be dropped while the classifier is being reprogrammed. Tile software cannot directly read the classifier’s SPRs and GPRs. But the classifier can be disabled and a program loaded that will expose the architectural state via the classifier’s PASS SPR, which is visible to mPIPE configuration space. The classifier must be disabled to update the instruction memory or lookup table. However the GPRs can be updated on an active classifier. This provides a means of communication between Tile software and the classifier. 6.2.7.12 Classifier “Blast” Re/Programming One or more classifiers can be reprogrammed using the “blast” programmer which provides deterministic downtime and programming time for performing program updates on the fly. A single classifier image is stored in the programmer. The image consists of state updates for the instruction memory, table, and registers. Using the programmer, a full or partial state update to the classifiers can be performed. 110 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services When the FLASH bit of the MPIPE_CLS_ENABLE register is written for one or more classifiers, the associated classifiers will stop accepting new packets. Once all associated classifiers have finished their current packet, they will be reprogrammed and re-enabled. The programmer image is writable via the MPIPE_CLS_INIT_CTL/MPIPE_CLS_INIT_WDAT registers. When triggered by the FLASH bit in the MPIPE_CLS_ENABLE register, the programmer begins reading at programmer-table entry 0. The programmer table is initially configured via MMIO stores and contains records consisting of a command followed by 1 or more data words. The final record in the table is a NULL command. The record format is described in Figure 6-10. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 Sel Reserved 9 8 7 6 5 4 3 2 1 0 Data Count Start Index Record Data (Data Count Words) Figure 6-10: Classifier “Blast” Programmer Record Format Table 6-8. Buffer Descriptor Formats Bits Description Data Count Number of instructions, table entries, or registers to be programmed Start Index First index of the structure to be programmed. Sel 0 1 2 3 Instructions Table Entries GPRs/hPtr/pdPtr/exc_pc EndOfRecords A programmer-table setup to program the entire classifier would look like Figure 6-11. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 0 0 0 1 1026-2049 2050 2051-2063 8 7 6 5 4 3 2 1 0 1024 Instruction 0 Instruction 1 … Instruction 1023 1-1024 1025 9 2 0 table[1] table[3] … table[2047] 0 GPR[1] GPR[3] … GPR[21] pdPtr ----- 0 (encodes 2048) table[0] table[2] … table[2046] 25 GPR[0] GPR[2] … GPR[20] hPtr exc_pc Figure 6-11: Classifier “Blast” Example Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 111 Chapter 6 mPIPE Architecture Note that each record does not need to be the maximum size for that structure. For example, if only 500 instructions are needed, then a smaller iMem record can be used. This will decrease the programming time. The initialization of hPtr (getpos) and pdPtr (putpos) sets the default value for those pointers at the start of each packet. The initialization of tPtr (mempos) sets the exception PC. The records can be in any order and record types can be repeated. If multiple records write the same structure and location, the later record entry will overwrite prior entries. The programmer table contains 2064 entries to allow all classifier states to be programmed if needed. Note that if all 2064 entries are written, the end-of-records indicator is not needed. The programming time is approximately 2.6 microseconds if all classifier state is being programmed. 6.2.7.13 SPRs The classifier provides special purpose registers (SPRs) for access to processor and packet state as well as communication with Tile software. SPRs are accessed via MFSPR and MTSPR instructions. The SPRs for the classification processor are defined in Appendix B: Classifier Instructions and SPRs. The classifier’s PASS SPR can be read from Tile software via the MPIPE_CLS_INIT_CTL/MPIPE_CLS_INIT_WDAT registers. 6.2.7.14 Classifier Tools Tilera provides tools and a baseline configuration to enable customization of the classification program. The tool set includes a c-compiler, assembler, and simulator. For additional information, refer to MDE mPIPE Programmer’s Guide (UG506). 6.2.8 iDMA Engine The iDMA engine moves data from the iPkt buffer into memory space visible to worker Tiles. The classification stage provides DMA control information including the buffer stack index, L2 padding, and checksum control. The load balancer indicates which notification ring should be written when the DMA transfer has completed. The iDMA engine consumes buffer descriptors from the buffer stack manager as needed and creates the linked-list chains within the buffers for iDMA scatter. 112 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services From MAC Channel Active State Current Size/iPkt-Block clsQ WorkQ for Classifier iPkt Packet Data BlkInfo Per-Packet Info (L2_size, CRC Err) Buffers Buffer Manager Done Notification Manager Classifier iDMA LoadBal CmdQ CSUM Retry Write Packet Data Figure 6-12: iDMA1 When cut-through is required, it is possible for the iDMA engine to run out of data on a given flow. Rather than stalling and blocking other flows that could make progress, the iDMA engine recirculates the stalled iDMA command back into the iDMA command queue. Once a packet has been completely written into Tile-visible memory space, the notification manager is informed that the packet descriptor can now be written into the worker’s notification ring. 6.2.8.1 Temporal Hints for iDMA Writes The Tile memory system provides a NonTemporal hint mechanism for describing the cache access properties of data. The NonTemporal hint is useful for reducing cache pollution in cases where the application might take a long time to access packet data after it arrives from the mPIPE. For writes, this hint has no effect on the architectural state of the processor. It is only used improve performance and/or determinism in the system. The iDMA engine uses the NonTemporal hint bit (NT_HINT) from the associated buffer stack as a hint to a cacheline’s home Tile as to whether or not the data is likely to be accessed by the application prior to being naturally displaced from the cache. Additionally, the TEMPORAL_CNT bit of the MPIPE_IDMA_CTL register can be used to indicate the delineation between temporally local data in the header versus temporally non-local data in the packet body. 1. Note that clsQ indicates size, channel, and handling information. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 113 Chapter 6 mPIPE Architecture 6.2.9 Load Balancer The load balancer assigns incoming packets to workers. The classifier creates a hashed flowID that it places into the BucketID of the packet descriptor. The load balancer uses the configuration of the Bucket to determine how to distribute packets. The load balancer allows all packets from the same bucket to be sent to the same worker Tile. Thus packet descriptors for a given flow will be processed in order without requiring software to maintain ordering and affinity. Override modes allow other distribution schemes. Implementation Note: The TILE-Gx36 device implements 4160 hash buckets. This allows a simple 12-bit hash function to map to the low 4K buckets while reserving 64 buckets for dedicated special purpose flows and applications. 6.2.9.1 BucketSTS The bucket status (BucketSTS) table keeps track of the number of packets inflight on a per bucket basis. For example, if the classifier determines that a packet ‘P’ belongs in bucket ‘B’ and no packets are queued for processing in bucket ‘B’, then the packet descriptor for ‘P’ can be sent to any eligible worker Tile – so the load balancer chooses the least busy worker ‘W’. However, once ‘P’ has been enqueued for processing at a specific worker, all subsequent packets for bucket ‘B’ must go to worker ‘W’ in order to maintain flow order. A counter associated with each bucket is stored in the BucketSTS table. When the counter is zero, the load balancer can assign any eligible notification ring (worker). The NotifRing index is then written into the BucketSTS table for the associated bucket. When the counter is nonzero, the current NotifRing index stored in the BucketSTS table will be used rather than picking a new NotifRing. Each time a packet descriptor is enqueued, the associated bucket’s counter is incremented. When the worker has completed processing a packet, it sends a message to the mPIPE, which decrements the bucket’s counter. Since a bucket is associated with a single NotifRing at any given time and the bucket counter is large enough to count the maximum packets that can fit in a ring, it is not possible for the bucket counter to overflow if software is releasing the bucket each time it releases the NotifRing. Note that in the “order-agnostic” override flow described in section “Load Balance Override Flows” on page 117, it is possible for a bucket to have packets outstanding to multiple notification rings. In this case, the bucket counter can wrap. But since the bucket count is being ignored, this does not impact operation. And since the counter wraps rather than saturating, it will return to zero once all packets have been processed. 6.2.9.2 Notification Groups The load balancer provides notification groups to support multiple load balancing domains. This allows groups of Tiles to be associated with specific buckets. Hence, when the classifier maps a packet to a specific bucket, it is also indicating a subset of notification rings that are eligible to receive the packet. Each bucket in the BucketSTS table contains a NotifRingGroup Index. This index is used to lookup a table which provides a bit mask of all eligible notification rings for the associated group. Implementation Note: The TILE-Gx36 device provides 32 notification groups. Each group has a 256-bit vector indicating all notification rings allowed to receive packets for buckets that map to that group. 114 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services Classifier Bucket BucketSTS Table NotifGroup Inflight Count Inflight NotifRing Group NotifGroup Table Eligible NotifRings Pick NotifRing NotifRing Table WorkerID NotifRing address/Head/Tail Figure 6-13: Load Balancer 6.2.9.3 Notification Ring Arbitration A tiered round-robin arbitration scheme is used to select a specific notification ring from the eligible notification rings indicated by the notification group associated with a specific bucket. The arbitration stage chooses the least full notification ring from among the group of eligible rings. The rings’ fullness is quantized into eight states based on programmable thresholds (MPIPE_LBL_QUANT_THRESHn registers, MPIPE_LBL_QUANT_THRESH0, for example). This simplifies the arbitration decision while still providing fair load balancing. The load balancer will prefer the least-full notification rings and will choose round-robin between notification rings at the same fullness quantization level. The highest state is “full”. Once a notification ring has reached the full state, no more packets will be sent to that ring. The threshold is automatically masked based on the ring size such that only the low-N bits are considered when comparing head and tail pointers in a ring of size 2 N. Care must be taken to insure that the thresholds are ascending when masked base on all active RingSizes in the system. The reset values of the thresholds provide an example of correctly programmed thresholds for all possible ring sizes. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 115 Chapter 6 mPIPE Architecture Figure 6-14 shows the default quantization thresholds on a log scale. Default Quantization Values (log scale) 100000 65534 12979 Number of Descriptors 10000 2069 2046 1000 691 510 179 126 100 51 RingSize128 21 10 RingSize512 RingSize2K 9 RingSize64K 4 2 1 1 0 1 2 3 4 5 6 Threshold Number Figure 6-14: Default Quantification Thresholds Table 6-9. Default Load Balancer Quantization RingSize128 RingSize512 RingSize2K RingSize64K Quantification Register 126 510 2046 65534 THR6(full) 51 179 691 12979 THR5 21 21 21 2069 THR4 9 9 9 9 THR3 4 4 4 4 THR2 2 2 2 2 THR1 1 1 1 1 THR0 To enhance the fairness of load balancing, the picker chooses round-robin within each notification group. In other words, for all NotifRings in a group at the lowest fullness state, the picker will choose round-robin by remembering which NotifRing was previously chosen for the group. Each group maintains the state of which NotifRing was last chosen and this state is updated only when the picker has been used. Thus a bucket that already has a nonzero count will not update the picker’s round-robin state. 116 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services 6.2.9.4 Load Balance Override Flows The load balance function can be overridden in several ways to support multiple application distribution models running simultaneously in the system. Each bucket can be individually configured into one of the following modes at initialization via the MPIPE_LBL_INIT_CTL/ MPIPE_LBL_INIT_DAT registers: • DFA (dynamic flow affinity): The bucket assigned to the least busy NotifRing in NotifRingGroup when the counter is zero. • FIXED: Bucket is statically assigned to a specific NotifRing hence NotifRingGroup/picker not used. • ALWAYS_PICK: Always select least busy worker. The application will perform its own locking and ordering on per-flow state as needed. Note that since the bucket can be going to multiple rings, it is possible for the bucket counter to wrap. Although the counter is not used in this case, it will return to zero if software properly releases the bucket for all packets. • STICKY: Sticky flow affinity. The NotifRing will be assigned using the picker ONLY when the current NotifRing is full. Note that in sticky mode, the initial LBL_INIT_DAT_BSTS_TBL.NOTIFRING setting in this register will be used until that NotifRing becomes full. Software must insure that the initial NotifRing is valid. The classifier selects buckets with mode attributes appropriate for the flow being distributed. The load balancer can also be overridden on a per-packet basis by the classifier by setting the NR bit in the packet descriptor. In this case, the NotifRing is selected by the classifier and the load balancer otherwise acts as if it is in “Static Assignment” mode. The bucket’s counter and current notification ring will not be updated. Packet descriptor based overrides take precedent over the bucket override modes. When software processes a packet with the NR bit set in the packet descriptor, it must NOT release the bucket since that packet does not increment the bucket count. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 117 Chapter 6 mPIPE Architecture Table 6-10 summarizes the various types of load balancer flows: Table 6-10. Load Balancer Flows Flow Type Bucket Count Bucket Current NR Picker RR-Arb State NR-Tail DFA Incremented. Packet is dropped if count reaches 64K. Updated by picker if count was zero Updated if count was zero Updateda FIXED Incremented Counter wraps if it Not Updated Not Updated Updatedb Updated Updated if current NR is full. Updated if NR is updated. ALWAYS_PICK reaches 64K.b STICKY Descriptor (Classifier) Override Not Incremented Not Updated Not Updated Drop due to NR full, DFA mode with bucket full, or classifier-drop Not Incremented Not Updated Not Updated Not Updated a. If the packet is dropped after load balancing, for example due to running out of buffers, the NR-Tail and count will NOT be updated. b. Although this state is updated, it is generally not relevant for this type of override flow. 6.2.10 Checksum The mPIPE provides dedicated checksum offload hardware typically used for TCP processing. The checksum’s seed and start location are generated by the classifier. The checksum is calculated by performing a 16-bit 1’s complement addition on all of the bytes from the start byte to the end of the packet (inclusive). The result of the checksum calculation is delivered as part of the packet descriptor at notification time. Packets with bad checksums can be re-distributed to software managed exception queues simply by writing the packet descriptor into a different queue. The packet data does not have to be moved. The classifier also supports 1’s complement addition in its ALU to provide checksum calculations across L3 headers as part of any IP header validation requirements. 6.2.11 Notification Once a packet’s data and packet descriptor have been written into Tile-visible memory space, the application is notified that it can begin processing the packet. This notification can be done in one of two ways: by writing a copy of the NotifRing tail pointer for software to poll, or by dispatching an interrupt. 118 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Services Write ACKs Coherence Tracker Classifier Write NR Tail or Interrupt LoadBal SQN & CTRs pDesc pDesc Buffer iDMA xfer Done: pDesc-handle, BuffDesc, NotifRing Sequence Numbers Notif Gen NotifQ Write Packet Curr Tail NotifRing State Write pDesc To NotifRing Figure 6-15: Notification Flow 6.2.11.1 Tail Pointer Updates – Polling Model For high bandwidth applications that want to poll for new packet arrival on one of their notification rings, the mPIPE supports an automatic tail-pointer update as packet descriptors are written into the notification ring. A copy of the ring’s tail pointer is stored in the first 8-bytes of the ring. The next 56 bytes in the ring are reserved although they can be used by software if the TUP_PTC bit of the MPIPE_NTF_CTL register is clear. Each time a new packet descriptor is written to the ring and is visible to Tile software, the notification manager performs a memory space write to update the tail. The master copy of the Tail is always maintained in the NotifRingTable at the mPIPE and can be accessed through MMIO to the mPIPE configuration space. 6.2.11.2 Notification Interrupts Each notification ring has an associated interrupt. This interrupt can be enabled by software to allow the worker tile to receive interrupts on new packet arrival. The interrupt supports a self-masking mode so that once an interrupt is delivered; subsequent ring updates will not trigger an interrupt unless software clears the interrupt status bit. This provides a Linux “NAPI” style of packet delivery where the application can switch between interrupt and polling driven work queues, depending on the bandwidth requirements. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 119 Chapter 6 mPIPE Architecture 6.2.11.3 Timestamp and Sequence Number Information The mPIPE provides timestamp and sequence number information as part of the packet descriptor. Timestamps provide a low-jitter indication of a packet’s arrival time. The timestamp is formatted similar to Linux’s NTP using 32-bits of nanosecond and 32-bits of second. The timer that delivers the timestamps can be adjusted/synchronized using Tile software, as needed. The timestamp is applied when the first byte of the packet is sent from the MAC into the iPkt buffer. The timestamp is automatically corrected by hardware to account for varying latency through the MAC to the mPIPE. Thus jitter is less than 50ns for 1000Mbps or faster ports – even between ports running at different speeds. Two sequence numbers are also provided by the mPIPE. The first is a global packet 48-bit sequence number applied and incremented for each packet. This allows software to determine a global order across all packets, regardless of the bucket to which the packet was mapped. The second sequence number is a configurable 16-bit sequence number. This sequence number is provided from a table indexed by the GP_SQN field generated by the classifier. This allows the classifier to provide independent sequence numbers on many different flows, for example on a channel or VLAN basis. Implementation Note: The TILE-Gx36 device provides 4160 general purpose sequence numbers thus the classification program can assign one per hash bucket if desired. Both types of sequence numbers are generated and incremented as packet descriptors are written to ensure that any packets dropped by the classifier program or by the iDMA engine due to buffer starvation do NOT get assigned a sequence number. 6.2.12 Counters The mPIPE supports 32 general-purpose counters. The packet descriptor generated by the classifier contains two 5-bit encoded values specifying which counters should be incremented when the packet descriptor is written into a notification ring. The counters are each 48-bits, saturating, read-to-clear, writable, and generate an interrupt on saturation. If both counter-selects are the same, the associated counter will only be incremented once. The interrupt associated with each counter will assert when the counter reaches saturation and on each increment that occurs when the counter is saturated. Both the counters and the sequence numbers are accessed through the MPIPE_SQN_CTR_CTL and MPIPE_SQN_CTR_DAT registers. The counters and sequence numbers are initialized to zero at reset. 6.2.13 Software Override Flows The mPIPE is designed to satisfy most system architectures without requiring any software/Tile resources for packet distribution. However, there will be some systems or circumstances within a system that require software to assume the roles of classification, load balancing, or buffer management. 6.2.13.1 Software Classification If the classifier is unable to determine to which bucket a packet should be assigned, the load balancer will be unable to guarantee ordering with other packets from the same flow. Rather than blocking all flows behind unclassified flows, the unclassified flows can be assigned by the classifier to a bucket and buffer stack reserved for classification processing. 120 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Ingress Channel Flow Control Software is responsible for processing the exception buckets – the packet descriptor generated by the classifier will be presented in the notification ring and software can decide how to further classify and distribute this packet. A dedicated notification group can be used to provide load balancing of the exception flow across multiple worker Tiles. The ordering of software-classified packets with respect to hardware-classified packets must be managed by software as required. The timestamp of the packet can be used to insert softwareclassified packets into the hardware-classified flow. 6.2.13.2 Software Load Balancing In systems where the hardware load balance architecture is insufficient, software load balancing can be achieved by using a single notification ring and having software copy packet descriptors from the hardware managed notification ring to software managed rings. Since packet data is stored in the Tile processor’s L3 cache, the packet data itself does not need to be moved. And since the packet descriptors fit within a cacheline and the cache system is optimized to move cache blocks, the expense of copying a cache block is generally not significantly higher than other smaller-grained communication. 6.2.13.3 Software Buffer Management For systems that require a buffer management scheme that cannot be realized with the mPIPE’s buffer stack mechanism, software must have a way to manage buffers on its own. To support software buffer management, the mPIPE can be directed to move data into a limited set of buffers on a buffer stack. These buffers could be homed on a specific Tile or set of Tiles. Software running on dedicated Tiles would then copy data from the hardware managed buffers into software managed buffers. This copy operation must be supported at line rate hence sufficient Tile resources must be provided to support a line rate L3 to L3 memcopy. The bandwidth-delay product of the descriptor notification, memcopy, and buffer return operations dictates the size of the buffer pool required for the hardware’s buffer stack. This product will typically be significantly smaller than a single Tile’s L2 cache. 6.3 Ingress Channel Flow Control For multi-channel implementations, such as Interlaken or 802.1Qbb ports that support flow control, the mPIPE provides per priority queue iPkt buffer occupancy counters. The counters increment each time a 128-byte iPkt block is consumed on the associated priority queue. The counter is decremented when the block is freed (DMA operation completed). Each MAC has controls to either select the queue from the packet data or override the queue number. The override can be used on links that are not connected to an 802.1Qbb fabric but still require priority queueing. When the counter reaches a high water mark as programmed in the MPIPE_PR_PAUSE_THR registers, back pressure is applied to the MAC using the mechanism defined for the particular interface (for example pause frames, priority pause, or other inband flow control). Implementation Note: The TILE-Gx36 device provides 16 priority queues. One for each 802.1Qbb priority queue plus eight additional queues so that each GbE port can be mapped to its own queue depending on the system configuration. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 121 Chapter 6 mPIPE Architecture 6.4 Packet Drops With a properly configured system, the mPIPE will not drop packets. However, the conditions below will cause packets to be dropped or truncated. 6.4.1 Drop/Truncate: iPkt Full If the iPkt buffer fills up while a packet is arriving from the MAC Interface, the packet will be truncated. This will be reflected in the packet’s descriptor. If the truncation occurs prior to any cut-through, the classifier will be informed of the truncation via its HEADER_FLAGS SPR. If the iPkt buffer is full when a new packet arrives, the packet will be completely dropped and the associated channel’s drop-count incremented. The following conditions can lead to the iPkt buffer filling up: • The classification program is not achieving line rate. • The buffer stack manager is unable to provide buffers fast enough since the LWM bit of the MPIPE_BSM_CTL register is set too low or stack manager is encountering unexpectedly high read latency. • There is insufficient mesh bandwidth for packet or descriptor writes. • There is excessive latency for packet or descriptor write-acks. • The clock rate is too low for the classifier or mPIPE. 6.4.2 Drop: Classifier Cycle-Budget If the classifier’s header queue is filling up and header processing has exceeded the cycle budget, the classifier will terminate processing and apply the default NotifRing and Dest. The default dest can be DROP hence this can lead to dropping the packet. See “Cycle Budget” on page 104. 6.4.3 Drop: Classifier Program The classifier can decide to drop packets based on the contents of the incoming packet. The DEST field in the packet descriptor allows the classifier to drop the packet data and the descriptor. Or drop just the packet data. The latter is useful for cases where the packet is already in memory and the classifier is being used on an eDMA loopback path. Or when the packet data has been completely analyzed by the classifier and there is no need for application software to examine the data. 6.4.4 Drop: NotifRing Full If workers are not draining their NotifRings fast enough, the ring will fill up. If this happens, the load balancer will drop both the descriptor and the packet data, and then increment the INGRESS_DROP_COUNT register. 6.4.5 Drop: Bucket Count Full If workers are releasing the NotifRing without releasing the bucket in the Load Balancer Bucket Status Data register (MPIPE_LBL_INIT_DAT_BSTS_TBL), it is possible for the bucket counter to reach its maximum value (64K) without the NotifRing becoming full. If the bucket is configured in DFA mode, descriptors arriving at the load balancer for the associated bucket will be dropped. For buckets configured in FIXED, ALWAYS_PICK, or STICKY mode, the bucket count is allowed to wrap and packets will not be dropped when the counter reaches its maximum value. 122 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Egress Services 6.4.6 Drop/Truncate: Out of Buffers If the stack assigned to a packet has no buffers, the iDMA engine will drop the packet. If the packet is chained across multiple buffers, it will be truncated at the point the iDMA engine ran out of buffers. The BE (buffer error) bit will be asserted in the packet descriptor, and the buffer chain will be terminated by a buffer descriptor with a CHAIN_INVALID indicator. The packet descriptor’s TR (truncate) bit will not be set for this case. The packet descriptor’s L2_Size field will indicate how many bytes were written to Tile memory, not the total packet size. For packets that have cutthrough, there will NOT be an extra buffer descriptor as is usually present. See “Buffer Release” on page 98. 6.5 Egress Services Packets being sent from Tile memory space to the I/O device use the egress flow of the mPIPE. Similar to the ingress flow, the egress portion of the mPIPE is channelized. Each egress channel has its own eDMA descriptor ring and is non-blocking between channels. 6.5.1 Typical Egress Flow MAC Distribution Egress Channel Picker 5 4 0|1|2|...|N Buffer Release To Tile or Buffer Stack Engine eDMA 3 ePkt eDMA Picker Read Packet Descriptor Manager 0|1|2|...|N 2 Read Desc descBuf 1 SW Post Message with Ring and Index Figure 6-16: Typical Egress Flow Egress packets typically use the following flow: 1. Software writes an egress descriptor in a ring in memory space and optionally sends an “egress post” MMIO write to the mPIPE. 2. The egress descriptor manager reads the egress descriptors from the ring stored in memory space. The egress descriptor describes the eDMA transaction to be performed. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 123 Chapter 6 mPIPE Architecture 3. The packet data is read from memory space and written into the ePkt buffer. 4. The data buffer is released back to the buffer stack manager or to software. 5. Once packets are ready for egress, a picker sends channelized packets to the MAC(s). 6.5.2 eDMA Packet Descriptors Packets destined for egress are defined using one or more eDMA descriptors. These descriptors are stored in rings in Tile memory space. Note: MiCA also uses the eDMA Descriptor format, refer to Section 11.2.1.3 Overview and Major Functional Blocks). The mPIPE provides multiple rings to support nonblocking egress flows and differentiated classes of service on the same egress channel. The rings can be set to one of four sizes: 512, 2K, 8K, or 65536 descriptors (each descriptor is 16 bytes). Implementation Note: The TILE-Gx36 device has 24 egress descriptor rings. Buffer Descriptor CSUM_SEED CSUM_START VA[31:7] C Size HWB Reserved StackIDX 8 7 6 5 4 3 Reserved CSUM_DEST Offset Reserved 2 VA[41:32] 1 0 Gen 9 NS Notif Reserved Size Bound 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 CSUM Each ring is associated with a particular channel and multiple rings can be assigned to the same channel. The eDMA descriptor format is shown in Figure 6-17 and described in detail in Table 611 (note that the last 8 bytes comprise a buffer descriptor in the same format that iDMA uses). 0x00 0x04 0x08 0x0C Figure 6-17: eDMA Descriptor Format Table 6-11. eDMA Packet Descriptors 124 Bits Field Description 29:16 Size Number of bytes to be sent for this descriptor. When 0, no data will be moved and the buffer descriptor will be ignored. If the buffer descriptor indicates that it is chained, the low 7 bits of the VA indicate the offset within the first buffer (that is, 127 bytes is the maximum offset into the first buffer). If the size exceeds a single buffer, subsequent buffer descriptors will be fetched prior to processing the next eDMA descriptor in the ring. 15:8 CSUM_START Start Byte of Checksum. The checksum start is relative to first byte of this descriptor. If multiple descriptors for the same packet have CSUM enabled, behavior is unpredictable. 7:0 CSUM_DEST Destination of checksum relative to the first byte of this descriptor. The destination bytes fall within the current descriptor space. CSUM_DEST must be less that what is specified in the Size bitfield. 11 Bound Boundary Bit. This transfer includes the EOP for this command. Clear on all but the last descriptor for an egress packet. 10 Notif Notification interrupt will be delivered when the descriptor has been completely processed. 9 NS NoSend. Nothing to be sent (packet was dropped by software). All buffers will be processed and returned as appropriate. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Egress Services Table 6-11. eDMA Packet Descriptors (continued) Bits Field Description 8 CSUM Checksum Generation Enabled. Once enabled, subsequent descriptors must not set the CSUM bit. Checksum will be calculated on bytes from CSUM_START to the end of the packet. 0 Gen Generation Number. Used to indicate a valid descriptor ring. For more information about Buffer Descriptor formats, see Figure 6-3 on page 95. 6.5.2.1 eDMA Descriptor Fetch The descriptor manager prefetches descriptors to optimize performance and bandwidth. In order to reduce the Tile memory system bandwidth dedicated to descriptor fetches, the descriptor manager tracks the state of a window of descriptors starting at the current head pointer. Each descriptor is in one of four states: • UNKNOWN. Indicates the number of the descriptor state not known, HW must fetch. • KNOWN_INVALID. Indicates the number of the descriptor is not valid, wait for post or hunt. • KNOWN_VALID. Indicates the number of the descriptor is valid, HW must fetch. • DONE. Indicates the number is already fetched and valid. A software descriptor posts state information within the window of descriptors ahead of the head pointer, which causes the state to go to KNOWN_VALID. The hardware fetches descriptors starting at the head pointer that are in the UNKNOWN or KNOWN_INVALID states. A fetched descriptor will update the state from UNKNOWN to KNOWN_INVALID or KNOWN_VALID. Implementation Note: The TILE-Gx36 device monitors incoming posts in a window of 64 descriptor locations in advance of the head pointer in order to prevent excess reads of invalid descriptors. Posts beyond the window being monitored by the hardware will not have any affect, but the descriptor engine will find the descriptor since it will fetch any descriptors in the UNKNOWN state. 6.5.2.2 eDMA Descriptor Hunt Mode Each ring has an optional hunt mode configured by the MPIPE_EDMA_DM_INIT_DAT register. This mode allows the descriptor engine to search for valid descriptors even if the state of the descriptor is KNOWN_INVALID. Rings that are not in hunt mode will not fetch any descriptors if the state of the head pointer is KNOWN_INVALID. Thus software must issue a post, as described in “Explicit eDMA Descriptor Post” on page 126 to cause the descriptor(s) to be processed. Rings that have been configured with hunt mode enabled do not require any software posting. Instead, the descriptor manager will periodically check for valid descriptors on the ring(s). The HUNT_CYCLES bit setting of the MPIPE_EDMA_CTL register controls how often the descriptor fetcher will read a ring that is in the KNOWN_INVALID state. Once descriptors are found on a ring, the descriptor manager will continue to fetch descriptors aggressively until it finds an invalid descriptor. Thus high performance applications with bursty behavior can operate without any posts at all. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 125 Chapter 6 mPIPE Architecture In order to increase the responsiveness of a ring in hunt mode, software can optionally send a descriptor post with the count field set to 0. This will “wake up” the ring and cause it to look for descriptors without waiting for the hunt-mode timer to expire. This reduces the latency and jitter on egress descriptor processing but is not otherwise required for high performance egress operation. Figure 6-18 shows the behavior for a ring operating in hunt mode. Fast Mode Req MaxReq Desc Post-0 Hunt Mode (Sleep) Timer Expired, Win Ring Arb Win Ring Arb Response: All Desc Valid Response: Some Desc Not Valid Send Request(s) Wait for Response Figure 6-18: eDMA Descriptor Ring Behavior in Hunt Mode Note that the descriptor engine will prefetch descriptors beyond the head pointer regardless of hunt mode. Hence the ordering of packet data writes and descriptor writes must be maintained as described in section “Descriptor Prefetch and Memory Ordering” on page 127. 6.5.2.3 Explicit eDMA Descriptor Post Software can optionally inform the eDMA descriptor manager that a descriptor or set-of-descriptors is valid by issuing an MMIO store to the associated eDMA ring. The store data contains the current tail pointer and a count of how many descriptors are valid. Although the hardware is designed to maintain line rate with single-descriptor posts, software can reduce the required descriptor-read bandwidth by batching two or more descriptors per posting. Explicitly posting descriptors improves the response time of the descriptor fetch engine and provides more predictable ring interleaving when multiple rings are active on the same egress channel. Egress rings that are not in hunt mode are required to use explicit descriptor posts. 6.5.2.4 eDMA Descriptor Ring Reordering In order to allow worker Tiles to write eDMA descriptors to the ring in any order, the eDMA descriptor engine supports a valid indicator in each descriptor. Workers can be assigned slots in the ring and can post their eDMA descriptor at anytime without regard for other workers that are sharing the ring. The descriptor engine will process the descriptors as they become valid from oldest to newest (head to tail). Thus the tail pointers in MMIO posts sent to the descriptor manager are NOT required to be in ascending order. To prevent the need for clearing the valid bit each time a descriptor is processed, the valid indicator is implemented using a generation number. The generation number is incremented each time the ring wraps, hence the hardware can always tell when a descriptor is valid by comparing the current generation number to the generation number stored in the descriptor. 126 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Egress Services Each time a descriptor is written into the ring, software optionally sends a message to the descriptor manager with the location that was written (these can also be batched by software, as described below). If the location posted matches the current head pointer, the descriptor manager will begin fetching descriptors for the eDMA engine. It will continue fetching until it encounters an invalid descriptor at which point it waits for a post to the oldest location in the ring. The eDMA hardware also supports batch posting of descriptors. The message sent to the descriptor manager contains a count of the number of descriptors that have become valid. If a batch of descriptors crosses the ring boundary, the descriptor manager will automatically wrap to the beginning of the ring. Software is encouraged to batch descriptors if possible as a way to reduce the overall Tile-memory bandwidth demand. 6.5.2.5 Descriptor Prefetch and Memory Ordering In order to maintain line-rate performance, the descriptor manager prefetches descriptors from the ring based on various heuristics. For example, the descriptor manager can prefetch descriptors to the next cacheline boundary or even additional cachelines under certain circumstances. Due to this hardware prefetching, software must never set the valid bit (generation number) on a descriptor unless that descriptor is recognized as being valid. This also means that software must write all other bytes of the descriptor prior to writing the generation number field of the descriptor. The TILE-Gx processor memory system guarantees that writes to the same cacheline, and hence to the same eDMA descriptor, will be observed by the descriptor manager in order. No fence is needed between the writes to the descriptor data and the descriptor’s generation number as long as the second 8 bytes of the descriptor is written in program order prior to the first 8-bytes. 6.5.2.6 Descriptor-Write and Descriptor-Post Ordering When software posts descriptors, it will typically write the descriptor data, issue a memory fence, then send the post MMIO write to the eDMA engine. The eDMA descriptor manager supports an optional hunt mode that allows software to operate without the memory fence described above. In this mode, software can write the descriptor data then issue the post without an intervening memory fence. This might cause the descriptor manager to fetch descriptors that are not yet valid. In hunt mode, the descriptor engine will continue to refetch the descriptor until it is valid. This mode trades software performance for potential extra read/response bandwidth in the case where the descriptor data has not yet become visible. The other potential risk with hunt mode is that an inadvertent software post of an invalid descriptor will cause the descriptor engine to continue to refetch the invalid descriptor “forever”. However, this will be rate limited by the HUNT_CYCLES setting. 6.5.2.7 Ring to Channel Mapping Each descriptor ring is associated with a single egress channel. Multiple rings can be assigned to the same channel in order to provide independent rings for different software applications or classes of service. Independent rings will not cause head-of-line blocking with each other hence a low priority flow and a high priority flow on the same egress channel but on different rings will not interfere. When packets are available from multiple rings targeting the same egress channel, round-robin arbitration is used to determine egress packet order. The mapping of ring to channel is configured via the MPIPE_EDMA_DM_INIT_CTL/MPIPE_EDMA_DM_INIT_DAT registers. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 127 Chapter 6 mPIPE Architecture 6.5.2.8 Descriptor Errors The eDMA engine detects descriptor errors and faults. When an error is detected, the associated ring is frozen and software must flush the DMA ring, as described in section “EDMA Ring Drain” on page 140. The eDMA descriptor errors that are detected are described in Table 6-12. Table 6-12. eDMA Fault Handling Error Type Description Handling Illegal Stack Descriptor attempted to return buffers to a stack whose STACK_ENA bit was clear in the MPIPE_EDMA_RG_INIT_DAT_STACK_PROT register. Descriptor is discarded, ring is frozen, the EDMA_DESC_DISCARD interrupt is triggered. Illegal ASID ASID_ENA bit was clear in the MPIPE_EDMA_RG_INIT_DAT_STACK_PROT register for the stack that was specified in the buffer descriptor. Size Error Size of transfer exceeded the size of hardware-returned unchained buffer. If buffer is NOT returned to hardware or buffer is chained, this error will not be flagged. TLB Fault No translation found for ASID/VA. 6.5.3 If associated ASID’s MPIPE_EDMA_ASID_FAULT_MODE register is set, handling is identical to the errors above (discard/freeze/interrupt). If the FAULT_MODE bit is clear, the descriptor will be retried until a valid translation is installed. Buffers The egress portion of the mPIPE uses the same buffer descriptor format as the one used in the ingress mPIPE (refer to Figure 6-3 on page 95). This allows buffer descriptors to be automatically recycled back to the buffer stack engine on egress. 6.5.3.1 Chaining To support line rate gather compatible with the iDMA chaining architecture, the eDMA engine supports linked-list-chained buffers where each buffer contains a buffer descriptor for the next buffer in the chain. The buffer descriptor is stored in the first 8 bytes of each buffer. The eDMA hardware determines when it has finished fetching buffers, based on the Size field stored in the eDMA packet descriptor and the size associated with each buffer descriptor it fetches. 6.5.3.2 Descriptor-Based Gather When the system requires high performance from a single ring and the buffer size is relatively small (less than 512 bytes), the hardware linked-list chase might not provide sufficient bandwidth since too few Tile memory reads can be launched to keep the egress pipe busy. In this case, software can instead provide a set of descriptors forming a gather list provided to the eDMA engine. This allows the hardware to prefetch data in order to keep up with the system line rate. If the data in Tile memory is stored in linked-list chain format, the eDMA descriptors must point to the data portion of each linked-list buffer (for example skip over the first 8 bytes in the buffer). All but the last eDMA descriptor must have its Boundary bit cleared indicating that subsequent descriptors make up the egress packet. 128 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Egress Services If software is using large buffers (512 or larger) or many simultaneous channels, C=1 mode is sufficient (buffer pointers stored in 1st 8 bytes of each buffer). The eDMA engine will also read ahead multiple descriptors so small packet performance is not impacted by the buffer chaining “gather” issue. 6.5.3.3 Transaction Sizing and Buffer Offsets The interaction between the eDMA descriptor size field, buffer descriptor chain field, buffer descriptor size field, descriptor offset, and hardware release field warrants detailed explanation and rules. • The eDMA size field dictates the number of bytes moved by the descriptor. This value does NOT include the 8-byte chain pointers that can be included in the first 8-bytes of each buffer. • If the buffer descriptor in the eDMA descriptor indicates that the buffer is chained, ALL buffers associated with this eDMA descriptor contain a chain pointer in the first 8 bytes. The one in the final buffer can be marked as INVALID. Hardware will return the empty buffer associated with the final buffer pointer if the DISABLE_FINAL_BUF_RTN bit of the MPIPE_EDMA_DIAG_CTL register is zero and the buffer is not marked INVALID. This occurs when an ingress packet was cut through and chained. • The VA provided by the buffer descriptor inside the eDMA descriptor provides the starting byte location of the packet data. When buffers are chained, the offset is inclusive of the 8-byte chain pointer field contained in the first 8 bytes of all chained buffers. When releasing buffers to hardware, the buffer stack manager does not store the low 7 VA bits. • Since buffers are only required to be aligned to 128B boundaries, it is not possible for hardware to process chained buffers with an offset larger than 127. In other words, the location of the chain pointer is derived by clearing the low 7 bits of the buffer VA. If software wishes to send packet data starting more than 127 bytes into a buffer, it must release the buffer back to software and NOT use buffer chaining. • For buffers with the HWB bit set (hardware release), a size error is detected if the buffer is unchained and exceeds the boundary of the buffer based on the encoded buffer size. When a size error is encountered, the descriptor will be discarded, an EDMA_DESC_DISCARD interrupt will be triggered, and the ring will be frozen. • The above size error check is NOT performed on buffers with the HWB bit clear because this check would not be possible in all cases since the buffer’s actual start address is not actually known (offset could be larger than 127 bytes). • Each buffer stack is associated with a single buffer size. If an eDMA buffer descriptor is returned to a hardware stack with a different size configuration, the buffer will be treated as having the size associated with the stack upon reuse for iDMA. 6.5.3.4 Buffer Release Each buffer descriptor used for eDMA contains HWB field, which indicates whether or not the buffer should be returned to the stack engine. When releasing buffers to hardware, the SIZE field in the buffer descriptor must match the size configured for the associated buffer stack. If these do not match, buffers might be lost from the associated stack or data within the associated ASID region might be corrupted. If software wants to manage its own buffers, the HWB bit must be clear and software need to determine that the eDMA transaction is complete using either the eDMA-ring interrupt or by reading the descriptor-complete count in the eDMA-ring MMIO structure. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 129 Chapter 6 mPIPE Architecture 6.5.3.5 Egress VA Translations Egress packet descriptors contain a buffer descriptor with a virtual address. This VA is translated into a physical address by the eDMA descriptor processing hardware using the ASID/region process described in Section 6.6 Virtual Memory. The StackIDX provided in the each buffer descriptor is used to determine which ASID (set of TLB entries) is used for the translation. If the HWB bit is set in the buffer descriptor, the buffer will be returned to hardware and an additional protection check is provided via the Stack Protection bits in the MPIPE_EDMA_RG_INIT_CTL/MPIPE_EDMA_RG_INIT_DAT registers. 6.5.4 eDMA Engine As valid descriptors are fetched by the descriptor manager, they are presented to the eDMA engine. Since multiple channels might have valid descriptors pending, a picker determines which descriptors to process based on ePKT (buffer) availability and round-robin fairness arbitration. Once a descriptor has been chosen, the eDMA engine reads the data from Tile memory space and writes the data into the ePKT buffer queue associated with the descriptor’s channel. Buffers are then freed to the buffer stack manager or messaged to software. In order to maintain line rate performance on any single flow, the eDMA engine performs descriptor processing (packet reads), response processing (write into ePkt buffer), and notification in parallel. This allows many simultaneous packet gather threads to be running in parallel to hide memory read latency, Tile resource contention, and temporary network congestion. 6.5.5 ePkt Buffering The ePkt buffer provides elasticity between the Tile memory system and the egress Mac Interface. This elasticity prevents variations in eDMA read latency from affecting the output bandwidth. The ePkt buffer size is measured in 128-byte blocks. Packets always consume an integer number of blocks and a block cannot be shared between two different packets. The buffer size is provided in the MPIPE_EDMA_STS register. The ePkt buffer blocks are divided into an undifferentiated pool and a reserved pool. Undifferentiated buffers can be consumed by any ring that has its DB bit set in the THRESH structure of the MPIPE_EDMA_RG_INIT_CTL/MPIPE_EDMA_RG_INIT_DAT register set. The reserved pool is divided between the eDMA rings based on the MAX_BLKS thresholds, which are programmable on a per-ring basis via the MPIPE_EDMA_RG_INIT_CTL/MPIPE_EDMA_RG_INIT_DAT registers. Head of line blocking between rings will not occur as long as the sum of all of the MAX_BLKS thresholds does not exceed the size of the reserved buffer pool. The size of the undifferentiated and reserved pools is configured via the UD_BLOCKS setting in the MPIPE_EDMA_CTL register. 6.5.6 Notifications There are several types of eDMA-complete notifications to provide software with flow control and buffer-complete information. 6.5.6.1 Descriptor Ring Head As descriptors are consumed by the eDMA descriptor fetch engine, the hardware updates the head pointer. The head pointer is accessed by an MMIO read to the “eDMA Ring” STRUCT. This mechanism allows software to know when the ring is full. There is no interrupt associated with descriptor after it has been read. It is assumed that software will either periodically read the head pointer or use the descriptor-complete interrupt/counter described below. 130 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Egress Services 6.5.6.2 Descriptor Complete Interrupt and Counter The descriptor ring head pointer provides an indication of what descriptors have been fetched from the ring, but it does not indicate that the transaction associated with the descriptor has completed. When software needs to know that the transactions associated with an eDMA descriptor have completed, the DescComplete interrupt and DescriptorCompleteCount are used. These indicate that all memory associated with the descriptor has been accessed and the buffer can be reused. Each ring contains a 16-bit rolling count of descriptors that have been processed. This counter can be masked based on the ring size to determine which descriptors have been completely processed. The DescriptorCompleteCount is directly accessible via an MMIO read to the “eDMA Ring” structure. The DescriptorComplete mechanism can be used when buffers are being managed by software rather than being returned directly to the buffer stack manager. 6.5.7 Checksum The egress mPIPE supports checksum offload typically used for the TCP body. The eDMA packet descriptor contains a checksum seed, start octet, total octet count, and destination start octet. As the eDMA engine writes data into the ePKT buffer, the checksum is tracked and updated. 6.5.7.1 eDMA Checksum Buffer Limitations Since the checksum result is typically stored in the header, the entire packet must be read before the packet can begin egress to the MAC. Thus buffering must be provided for packets undergoing checksum. Multi-channel implementations might provide insufficient buffering for checksum of jumbo frames. Attempting to checksum a packet larger than the cut through size will result in a corrupted checksum. Implementation Note: The TILE-Gx36 device provides a 195KB ePKT buffer dynamically partitioned between all active egress rings. A four-ring implementation can provide checksum offload for all frame sizes. However, since line-rate performance requires double-buffering, a 24 ring configuration can only provide hardware checksum on 4KB byte frames (4KB*2*24 = 195KB). Descriptors with checksum enabled must follow these rules: • The CSUM_START field specifies the first byte to include in the checksum. All bytes from CSUM_START to the end of the packet will be included. • The CSUM_DEST field specifies the target of the first (more significant) byte of the checksum result. • At most one eDMA descriptor per packet can have its CSUM_ENA bit set. • For descriptors with CSUM_ENA=0, CSUM_START, CSUM_DEST, and CSUM_SEED must be zero. • CSUM_START or CSUM_DESTcan specify a byte beyond the current eDMA descriptor. The checksum engine will wait until CSUM_START bytes have been collected across as many descriptors as necessary before beginning the checksum. • If the total packet size is larger than the cut-through threshold, only the bytes up to the threshold will be included in the checksum operation. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 131 Chapter 6 mPIPE Architecture • The CSUM_SEED is formatted in network byte order. In other words, bits[7:0] of CSUM_SEED will be added to the byte specified by CSUM_START. The resulting checksum will be inverted and placed into CSUM_DEST with bits[7:0] going into the byte pointed to by CSUM_DEST and bits[15:8] going into the byte pointed to by CSUM_DEST+1. • The CSUM_DEST and CSUM_START fields do not include bytes that are part of a NoSend=1 descriptor. In other words, if CSUM_DEST is set to 100 with NoSend=1 (and boundary=0), the CSUM_DEST will be 100 bytes into the next descriptor that has NoSend=0. 6.5.8 Egress Picker Each egress channel can generate back pressure at any time – even within a packet. To maintain line rate service on other channels, an egress picker matches available channels from the ePkt buffer to non-blocked egress channels. This provides low latency response to flow control events (for example pause frames, priority pause frames, and channel flow control) while keeping the link(s) saturated. 6.5.8.1 Egress Priority Arbitration Each eDMA ring has a programmable priority level in the PRIORITY_LVL bit of the MPIPE_EDMA_RG_INIT_DAT_MAP register. This field selects one of three priority levels for the ring. The egress arbiter chooses round-robin within each priority level and maintains strict priority between the levels. Thus rings set to level-2 will always have priority over rings set to levels 0 and 1. The mPIPE egress picker also provides bandwidth controls for each priority level. This can be used for basic rate shaping and prevention of starvation between the strict priority levels. The bandwidth controls are located in the MPIPE_EDMA_BW_CTL register. A token bucket scheme is used wherein each of the three egress priority levels is provided with a token bucket. A token for the ring’s priority level must be available in order for a packet to begin sending. Each time a 128-byte block is sent, a token is consumed. The token buckets for each priority level are replenished based on the settings in this register. These register settings control the rate at which tokens are replenished for each priority level. Each unit represents approximately 6*LINE_RATE/(N+1) where LINE_RATE includes packet overhead. When set to 0, tokens are replenished as fast as they can be consumed, hence setting all PRIORn_RATE values to zero will revert to a strict-priority scheme. Note that the packet arbiter can run faster than line rate since there is buffering in the egress path. For this reason, settings that allow the bandwidth to exceed LINE_RATE are meaningful. The setting for PRIOR2 is typically higher than PRIOR1 and PRIOR0 in order to prevent starvation. Similarly, PRIOR1 is typically set higher than PRIOR0. The LINE_RATE divider setting in the MPIPE_EDMA_BW_CTL register can be used to set the coarse granularity for the expected egress bandwidth and the PRIORn_RATE settings fine tune the bandwidth. Additionally, the BURST_LENGTH bit determines the hysteresis of the token buckets. 6.5.8.2 Egress Priority Flow Control Each eDMA ring has a programmable mask in the PRIORITY_QUEUES bit of the MPIPE_EDMA_RG_INIT_DAT_MAP register indicating which priority queue(s) will be sent using the ring. When a MAC applies back pressure to a particular priority queue using the priority-pause mechanism, any rings with the associated mask bit set will be back-pressured. Software must ensure that packets targeting a given priority queue are sent using a ring that has the associated PRIORITY_QUEUES bit set. 132 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Virtual Memory 6.5.9 Special Flows The eDMA engine provides special flows to support packet drops and loopback. 6.5.9.1 NoSend Option If the system requires packets to be dropped but still wants to return buffers to the buffer stack manager, the NoSend mode can be used. When the NS bit is set in the eDMA descriptor, the eDMA command will be processed as normal, except no data will be transferred to the MAC. All buffers will be returned as specified in the eDMA and buffer descriptors, and the notification and HWB (buffer return) fields in each buffer descriptor are still honored. All checksum fields in NoSend descriptors must be zero. The boundary bit is ignored and treated as if zero on NoSend descriptors. Thus the last descriptor for a packet must move at least one byte – a NoSend descriptor can not be used to terminate a “real” packet. If application software has reserved a slot in an eDMA ring but does not want to send any packet and does not wish to return any buffers, it can post a descriptor with NoSend=1 and Size=0. Note that descriptors with NoSend=1 and a nonzero Size field must always follow a descriptor with Boundary=1 or another NoSend=1 descriptor. In other words, a NoSend=1 with Size!=0 cannot be used in the middle of a “normal” set of descriptors. 6.5.9.2 Size=0 Option Descriptors with size=0 are No-Ops. These are typically used to fill a slot in the ring that is not going to be used and needs to be skipped over. Size=0 descriptors should have all CSUM-related fields set to 0. 6.5.9.3 eDMA Loopback The ingress flow has channels dedicated to receiving data looped back from eDMA. This allows the high performance eDMA gather portion of the mPIPE to feed packet data into the iDMA classification, load-balance, and distribution flows. This option is used, for example, to treat data moved from a PCI interface as packet data and distribute to worker Tiles as if it had arrived on a packet interface. Bandwidth that is used for eDMA loopback is unavailable for normal egress traffic. So, for example, if a 40Gbps implementation is using 20Gbps of loopback bandwidth, then the egress channels will only support 20Gbps. Typically this restriction is manifested as a system requirement based on the amount of ingress PDA offload supported – in this example, 40Gbps. The PRIORITY_QUEUES bit of the MPIPE_EDMA_RG_INIT_DAT_MAP register determines which priority queue packets on an eDMA loopback channel to consume. The highest set bit in the PRIORITY_QUEUES mask is decoded as the priority queue. Implementation Note: The TILE-Gx36 device provides four dedicated loopback channels. 6.6 Virtual Memory The mPIPE supports virtual addressing on all structures with which software interacts. Each buffer stack is associated with an address space identifier (ASID). Each ASID provides a set of VA-toPA translations through the I/O TLB. Implementation Note: The TILE-Gx36 device supports 32 ASIDs and 16 TLB entries per ASID. Each buffer stack is statically assigned to an ASID. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 133 Chapter 6 mPIPE Architecture Buffer descriptors are associated with a single buffer stack and hence a specific virtual address space. As buffer descriptors are processed by the mPIPE, the VA is converted to a PA by searching for a valid translation within the set of TLB entries associated with the buffer stack’s ASID. The buffer stacks themselves exist in physical memory with a set of attributes setup by configuration software. The application software’s only interaction with the stack is through the buffer stack manager so no VA-to-PA translation is necessary – the PA space of the stack is essentially private to the stack engine. Software provides buffer descriptors as part of the eDMA packet descriptor. In order to prevent software-generated buffer descriptors from accessing unauthorized buffer stacks and the associated ASID, each eDMA descriptor ring provides a protection mask, which indicates which buffer stack(s) the eDMA ring is allowed to access. Similarly, software configures each notification ring and eDMA descriptor ring with a set of physical memory attributes including start-PA, homing information, and caching hints. The ring must reside in contiguous physical memory. eDMA rings are configured using the MPIPE_EDMA_DM_INIT_CTL/MPIPE_EDMA_DM_INIT_DAT registers. The ring protection mask along with the eDMA ASIDs are configured using the MPIPE_EDMA_DM_INIT_CTL/MPIPE_EDMA_DM_INIT_DAT registers. It is up to software to map the rings into virtual address space(s) for the worker Tiles. Communication between the worker and the mPIPE related to notification and eDMA rings uses head/tail pointers relative to the start of the ring so no translation is necessary. 6.6.1 I/O TLB Details System software is responsible for TLB fault handling. The following guidelines and properties apply to mPIPE’s TLB structure: 134 • eDMA and iDMA share a common I/O TLB. Each has a micro-TLB, which is not software visible. • The micro-TLBs can be flushed by writing a 1 to the MTLB_FLUSH bit of the MPIPE_TLB_CTL register. This is used for shooting down TLB entries or ensuring that the micro-TLB is a subset of the main TLB on replacement. • eDMA and iDMA each have their own interrupt binding (one binding each). The interrupts are both in vector-0: IDMA_TLB_MISS and EDMA_TLB_MISS. • On a miss, the relevant information is placed into the MPIPE_TLB_IDMA_EXC and MPIPE_TLB_EDMA_EXC registers. • For iDMA, the DMA engine will stall on a fault unless the associated ASID is in flush-on-fault mode as configured in the MPIPE_IDMA_ASID_FAULT_MODE register. • If ASIDs are configured for iDMA flush-on-fault, it is possible for multiple misses to be reported. In this case, only the most recent fault will be captured in the MPIPE_TLB_IDMA_EXC register. • For eDMA, the DMA descriptor that faults is retried. Other descriptor rings will continue to make progress. • It is possible for multiple eDMA rings to report faults. Only the most recent fault will be recorded in the EDMA_TLB_MISS interrupt. • System software must handle iDMA faults immediately if flush-on-fault mode is not being used. Otherwise, packets will be lost as soon as the iPkt buffer overflows. Fault handling time must be on the order of 1-5 microseconds. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice PA Distribution • eDMA faults will result minor lost performance in the eDMA engine as the faulting rings will retry until the fault is handled. In most cases, this loss will not be perceivable unless many rings are faulting while one ring is trying to achieve significant small-packet performance. 6.7 PA Distribution The mPIPE uses the same algorithm as the Tile to assign a cacheline to a home Tile based on the physical address. Each translation region, stack, and ring specifies if its associated physical memory is homed on a specific Tile or hashed-for-home across a set of Tiles. Every mPIPE function that accesses Tile physical memory space contains a HFH table that translates a physical address into a home Tile. These tables are typically setup once at mPIPE initialization time. The tables are accessed via the MPIPE_HFH_INIT_CTL/MPIPE_HFH_INIT_DAT registers. 6.7.1 Locality Hints Each TLB entry has a NonTemporal hint used to indicate the locality properties of data contained on the page. For packet data that is likely to be accessed within a relatively short period of time, the NT hint should be set to 0. This will cause packet data writes to be marked as MRU (most recently used) in the home Tile’s cache and hence be less likely to be displaced. For packet data that is not likely to be accessed within a relatively short period of time, the NT hint bit should be set to 1. This will cause the packet data to be marked as LRU (least recently used) and will be more likely to be displaced thus reducing the cache footprint of streaming packet data. Table 6-13 summarizes the caching characteristics of packet read/write data based on the locality hint and state of the cache. Table 6-13. NonTemporal Hint Behavior Operation Non-Temporal Hint Cache State Behavior Write 0 Miss Block is allocated in cache and marked as MRU. Hit Block is updated in cache and marked as MRU. Miss Block is written to main memory and NOT allocated in the cache. Hit Block is written to the cache but the LRU state is not updated. Miss Data is fetched from main memory but not allocated in the cache. Hit Data is fetched from the cache but the LRU state is not updated. Miss Data is fetched from main memory but not allocated in the cache. Hit Data is fetched from the cache and the cacheline is marked clean. NOTE: this results in the contents of the cacheline becoming unpredictable to Tile software so this must only be used if there are no Tile consumers of the packet data after egress. 1 Read 0 1 For packet data writes, the NonTemporal hint comes from the associated TLB entry. However, the bit setting of the MPIPE_IDMA_CTL register can be used to override the setting used for the first N blocks of each packet. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 135 Chapter 6 mPIPE Architecture For packet data reads, the NonTemporal hint can be configured on a per-ring basis to be sourced from the TLB entry or fixed. This setting is in the MPIPE_EDMA_RG_INIT_DAT register for each ring. For the NotifRings and BufferStacks, the NonTemporal hint is programmable in the associated setup registers. For descriptor reads, the NonTemporal hint is always 0. 6.7.2 Pinning For I/O writes, an optional pinning attribute can be assigned on a per-TLB entry or per structure (NotifRing, buffer stack, etc.) basis. When asserted, the Tile’s I/O-pinned ways will be used for write data on miss. This reduces the cache footprint of the associated I/O data for applications that require explicit cache control. I/O reads do not use the pinning attribute since misses at the home Tile never install in the cache. 6.8 MMIO Communication with the mPIPE uses MMIO. The physical address is broken into fields that map the address into the various mPIPE regions. 2))6(7 5(*,21 5HVHUYHG[ 69&B'20 5HVHUYHG[ Figure 6-19: MMIO Physical Address The offset field is interpreted as described in Table 6-14: Table 6-14. MMIO Physical Address Bit Descriptions Bits Name Type Reset Description 39:36 SVC_DOM RW 0 This field of the address indexes the service 16 entry domain table. 28:26 REGION RW 0 This field of the address selects the region (address space) to be accessed. 25:0 6.8.1 OFFSET RW 0 Value 0 Name CFG 4 5 6 IDMA EDMA BSM Meaning Access to Configuration space. Protection level is provided by the service domain vector. Access to iDMA NotifRing and Bucket release. Access to eDMA descriptor rings. Access to the buffer stack manager This field of the address provides an offset into the region being accessed. MAC Configuration Registers Access to each MACs’ configuration space is via the MMIO configuration region for mPIPE. The address for configuration space is described in Figure 6-20 and defined in Table 6-4. 136 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice MMIO Figure 6-20: Configuration Address Format Table 6-15. Configuration Address Format Bit Descriptions Bits Name Type Reset Description 21 INTFC RW 0 Interface being accessed. Value Name Meaning 0 mPIPE Access to mPIPE registers 1 MAC Access to MAC registers 20:16 MAC_SEL RW 0 Selects the MAC being accessed when bit[21] is 1. 15:0 ADDR RW 0 Register Address. 6.8.2 Service Domains The mPIPE provides 32 independent service domains. Each service domain allows or disallows access to specific configuration domains, groups of buffer stacks, groups of NotifRings and buckets, and eDMA rings. Each service domain has an entry in the service domain table. A table entry consists of bits associated with services within the mPIPE. When a bit is set, access is allowed. When clear, MMIO writes will be ignored and reads will return unpredictable data. When an access is rejected due to a service domain check, an error response is returned to the requesting Tile. This results in an asynchronous MMIO_ERROR interrupt at the requesting Tile. Accesses that touch multiple service domains, such as the combined NotifRing/Bucket release, will be completely rejected if any individual check fails. The service domain table (Table 6-17) is accessed via the MPIPE_MMIO_INIT_CTL/MPIPE_MMIO_INIT_DAT registers. By default, all services are enabled for all service domains. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 137 Chapter 6 mPIPE Architecture Table 6-16. MMIO Service Domain Table Entry Bits 31:0 Description Allow access to NotifRings for releases. The NotifRings are divided into 32 regions based on the MSBs of the NotifRing Index. Each bit corresponds to one of these regions. For example, if bit[6] is set, then access is allowed to any NotifRing with an index whose 5 MSBs are 00110(b). Implementation Note: On the TILE-Gx36, there are 256 NotifRings so each bit corresponds to the encoding of NotifRingIndex[7:3]. For example, NotifRings 0-7 are represented by bit[0]. NotifRings 8-15 are represented by bit[1] etc. 63:32 Allow access to Buckets for releases. The upper 16 bits are used to provide access to the upper “non power of 2” buckets. The lower 16 bits provide access to the lower “power of 2” buckets. See the register spec for details. Implementation Note: On the TILE-Gx36, there are 4160 Buckets so each bit in the lower 16 bits corresponds to the encoding of BucketID[11:8] and is used when BucketID[12] is 0. When BucketID[12] is 1, the upper 16 bits of the vector are used and are index be BucketID[5:2]. 95:64 Allow access to Buffer Stacks for buffer releases and fetches. Implementation Note: On the TILE-Gx36, there are 32 Buffer Stacks so each bit is associated with a single stack. 119:96 Allow access to eDMA Rings for descriptor posts and head pointer reads. 121:120 Configuration protection level. An access to a service domain set to Level-2 can access registers at 2 and below. Level-1 can only access level-1 and below. Level-0 can only access level-0 registers. 6.9 Interrupts Interrupts are sent to Tile software using the IPI mechanism. An interrupt consists of a target TileID, InterruptNum, and EventNum. The TileID is the Tile receiving the interrupt, the InterruptNum selects one of the four IPI interrupt levels at the Tile, and the EventNum is the event number within the specified interrupt level. The mPIPE provides a binding for each interrupt that it generates. A binding consists of the TileID, InterruptNum, and EventNum. The mPIPE Interrupts are summarized in Table 6-17. Table 6-17. mPIPE Interrupts 138 Int Name Description Interrupt VecNum.BitNum BSM_BAD_VA Buffer stack manager received a buffer post with a bad VA 0.0 BSM_LIM_ERR Attempt to post buffers beyond the size of the associated buffer stack 0.1 CLS_TINT Classifier wrote the TileInt SPR 0.2 EDMA_POST_ERR Post received but no valid descriptor 0.3 IDMA_CTR_OVFL Counter Overflow 1.[0-31] EDMA_DESC_COMP eDMA Descriptor complete 2.[0-23] Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice UserIO Each interrupt is assigned to a vector, which can be used to clear the interrupt status. Each vector provides access via a Read/Write-one-to-clear address and a ReadToClear address. The vector registers are MPIPE_INT_VECx_W1TC and MPIPE_INT_VECx_RTC (for example, MPIPE_INT_VEC1_W1TC) registers. Each interrupt binding of the registers are MPIPE_INT_BIND and contain the following fields: • Enable Bit. When 0, the interrupt will not be sent. The status bit will still be updated. • Mode Bit. When 1, interrupt will be dispatched each time it occurs. When set to 0, the interrupt is only sent if the status bit is clear. • TileID(8) • IntNum(2) • EventNum(6) 6.10UserIO Bulk data transfer to and from the mPIPE uses the Tile memory system and hence provides direct user access via virtual memory. Communication with the mPIPE is through the MMIO interface (see Section 6.6.1 I/O TLB Details). System software controls access to the mPIPE via page table mappings. Hence process-level protection and user access is provided as part of the virtual memory system. Interrupts are delivered via the IPI and are configured through bindings on each structure that generates interrupts such as eDMA rings, buffer stacks, and various other exceptions. 6.11Flush Mechanisms When an application crashes or needs to be restarted, related resources in the mPIPE need to be flushed before they can be reallocated for another/new client. For example, there can be writes inflight to the application’s NotifRing(s) or descriptors pending processing for the application’s eDMA Ring(s). The mPIPE provides flush/drain mechanisms to aid system software in cleaning up a particular flow without disturbing the performance other flows. 6.11.1 MMIO Access Drain An application can have MMIO loads and stores outstanding to the various mPIPE structures including the buffer stack manager, load balancer, eDMA rings, or config space. An MF executed on the application’s Tile(s) will guarantee that all MMIO transactions have completed and any interrupts associated with those transactions will have been posted to the Tile. 6.11.2 NotifRing Drain Before it can drain an application’s NotifRing(s), system software must first make sure no new packets will target the NotifRing. The procedure for doing this depends on the system configuration and can require updating the classifier program, the load balancer configuration, or both. The NotifRing’s COUNT field of the MPIPE_LBL_INIT_DAT_NR_TBL_1 register can also be written to 0xfffe to force all incoming packets that target the NotifRing to be dropped. Setting the count must be done if the classifier program forces the NotifRing field in the descriptor since the group and bucket settings have no effect in that case. This is useful if the NotifRing is going to be immediately restarted after being drained. Note that the load balancer must be temporarily frozen via the FREEZE bit of the MPIPE_LBL_CTL while making configuration changes to the load balancer. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 139 Chapter 6 mPIPE Architecture Once the NotifRing has been configured to no longer receive packets, software must poll the MPIPE_LBL_INIT_DAT_NR_INFL_CNT for the associated NotifRing until it reaches zero. This insures that there are no data, descriptor, or tail-pointer writes inflight to the associated NotifRing. Any remaining interrupts for the NotifRing can be cleared by reading the interrupt status vector from the Tile bound to the associated interrupt. An MF after this read will insure that all interrupts for the NotifRing have been delivered to the Tile. In summary, to drain a NotifRing software must: 1. Freeze the load balancer. 2. Remove NR from group and buckets. This includes zeroing the bucket count on any DFA buckets and reassigning the bucket on any FIXED or STICKY mode buckets. 3. Optionally set NotifRing’s COUNT to 0xfffe (this is done if packets are expected to still be getting assigned to the NotifRing’s group). 4. Un-Freeze load balancer. 5. If the classifier needs to be reprogrammed: a. Reprogram the classifier and poll the PGM_PND bit of the MPIPE_CLS_ENABLE register until clear. b. Set the CLS_FENCE bit in MPIPE_CLS_CTL register and poll until clear to be sure all packets using the old program have been delivered to the load balancer. c. At this point, no new packets will be targeting the NotifRing. 6. Poll MPIPE_LBL_INIT_DAT_INFL_CNT until it reads zero. 7. Read Interrupt Status register (INT_VEC*_RTC, for example the MPIPE_INT_VEC0_W1TC register) from the bound Tile. 8. MF. 6.11.3 Ingress Channel Drain If a link goes down, the MAC will automatically terminate any inflight packets with a MACERROR status. Software can insure that all packets for the MAC and associated channel(s) have drained by polling for the per-channel iPKT counts to reach zero. The per-channel counters are accessed by setting the DIAG_CTR_SEL bit in the MPIPE_IDMA_CTL register to CHANNEL and then reading the DIAG_CTR_VAL bit of the MPIPE_IDMA_STS register. 6.11.4 EDMA Ring Drain Before it can drain an eDMA ring, software first freezes the ring to prevent new descriptors from being fetched. It then must set the FLUSH bit for the ring to drain any data from the ePkt buffer. Once the buffer is completely flushed, the ring state must be re-initialized. Flushing the ring can result in corrupted packets on the MAC interface since partially buffered or even partially sent packets might need to be truncated. The MAC will insert bad CRC into flushed packets to insure the packet is discarded at the receiving node. The steps required are shown below: 1. 140 Set the FLUSH and FREEZE bits in the ring’s MPIPE_EDMA_DM_INIT_DAT_SETUP register. This will prevent new descriptors from being fetched and prevent buffered-descriptors from being processed. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Flush Mechanisms 2. Issue a memory fence. 3. Poll the FLUSH_PND bit in the MPIPE_EDMA_CTL register until it is clear. This clears out existing packet data for the ring and allows pending requests to complete. 4. Issue a memory fence. 5. Set the FENCE bit in the MPIPE_EDMA_CTL register. 6. Issue a memory fence. 7. Poll the FENCE bit in the MPIPE_EDMA_CTL register until it is clear. This clears any remaining requests. 8. Poll the FLUSH_PND bit of the MPIPE_EDMA_CTL register until clear. This clears remaining data from any drained requests. At this point, the ring has been flushed. 9. Get the descriptor complete count from COUNT bits of the MPIPE_EDMA_POST_REGION_VAL register. This is the rolling count used for this DMA ring that indicates how many descriptors were processed (and if it had any associated buffers returned). 10. Set the ring’s head pointer to zero via the HEAD bits of the MPIPE_EDMA_DM_INIT_DAT_HEAD register. 11. Set MPIPE_EDMA_DM_INIT_DAT_DESC_STATE0 to 1 and MPIPE_EDMA_DM_INIT_DAT_DESC_STATE1 to 0 to return the descriptor fetch to its initial state. 12. Flush any outstanding interrupts by reading the associated interrupt vectors. 13. Issue a memory fence. At this point, the ring can be reused. The procedure described above will flush a ring regardless of any backpressure being received from the MAC, corrupted descriptors, or descriptors with TLB misses. Before it can reduce or eliminate having corrupted (truncated) packets on the MAC, software must first allow normallyflowing packets and descriptors to complete. This is done by setting the ring’s FREEZE bit but not its FLUSH bit. Then the descriptor-complete count can be compared to the head pointer to see when all of the ring’s outstanding descriptors have been completed. Finally, the associated ring’s ePKT block counter can be read to see that it is empty. This provides a cleaner shutdown for rings that are otherwise behaving properly. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 141 Chapter 6 mPIPE Architecture 142 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 7 XAUI MAC I NTERFACE 7.1 Introduction The TILE-Gx™ XAUI MAC and PCS provide a 10Gb/s Ethernet interface to mPIPE™. Configuration of the MAC is through the mPIPE MMIO space. 7.1.1 Features • Compatible with IEEE Standard 802.3 • May be configured as four SGMII ports or single XAUI port • Optional double-rate XAUI mode for 20Gbps operation using four lanes at 6.25Gbps (see data sheet for specific device support) • Custom modes for lower overhead and higher throughput • Supports 802.1Qbb priority-based flow control • High precision timestamping and IEEE 1588 • MDIO and Interrupt interfaces to off-chip PHYs • Configurable CRC • Multiple loopback modes and pattern generators for in system test and characterization • Independent polarity reversal on TX and RX • Support for 802.3az Energy Efficient Ethernet 7.2 Register Spaces Access to the XAUI MAC is via the mPIPE’s MAC interface in MMIO space. The format for the physical address is shown in Figure 7-1. 5(* 0$&B6(/ ,17)& 5HVHUYHG[ 5(*,21 5HVHUYHG[ 69&B'20 5HVHUYHG[ Figure 7-1: Physical Address Format Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 143 Chapter 7 XAUI MAC Interface Table 7-1. Physical Address Format Bit Descriptions Bits Name Type Reset Description 39:35 SVC_DOM RW 0 This field of the address indexes the 32 entry service domain table. 28:26 REGION RW 0 This field of the address selects the region (address space) to be accessed. For the config region, this field must be 0. 21 INTFC RW 0 Interface being accessed. Value Name Meaning 0 MPIPE Access to MPIPE registers 1 MAC Access to MAC registers 20:16 MAC_SEL RW 0 Selects the MAC being accessed when bit[21] is 1. 15:0 REG RW 0 Register address. Registers in the XAUI MAC are all 8-bytes. Accesses smaller than 8-bytes are not supported and will result in an MMIO error returned to the requesting Tile. 7.3 MAC and Channel Mapping MACs are assigned MAC-Numbers in hardware. This number is used in the MAC_SEL field of the MMIO address when accessing MAC registers. System software may perform MAC “discovery” by reading the MPIPE_XAUI_MAC_INFO register that is always at address 0x0000 in each MAC’s address space. Each XAUI MAC is also assigned to a specific hardware channel in mPIPE. This mapping is provided in the CHANNEL bit of the MPIPE_XAUI_MAC_INFO register and is also described by the mPIPE’s MPIPE_MACn_MAP registers. The channel numbers are assigned to eDMA rings by system software. See “Ring to Channel Mapping” on page 127. 7.4 Port Configuration The XAUI port is enabled via the MPIPE_MAC_ENABLE register. Basic MAC settings are configured in the MPIPE_XAUI_TRANSMIT_CONTROL and MPIPE_XAUI_TRANSMIT_CONFIGURATION registers and MPIPE_XAUI_RECEIVE_CONTROL and MPIPE_XAUI_TRANSMIT_CONFIGURATION registers. Once they are enabled, the link status can be monitored through the MPIPE_XAUI_PCS_STS register. A port that is disabled will automatically turn off the SERDES and operate in a low power mode. MAC registers, interrupts, and MDIO functions are still accessible when a port is disabled. For XAUI ports that support double-rate mode (see TILE-Gx36 Data Sheet (DS400)), the DOUBLE_RATE bit of the MPIPE_XAUI_PCS_CTL register must be written prior to enabling the port via the MPIPE_MAC_ENABLE. 144 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Flow Control 7.4.1 Lane Sharing with SGMII The XAUI port may be reconfigured into four independent SGMII ports. This is controlled by the MPIPE_MAC_ENABLE register. When one or more lanes are operating in SGMII mode, the XAUI MAC is no longer used and instead the SGMII MACs control the lane. See Chapter 8: SGMII MAC Interface. The AVAIL bits of the MPIPE_XAUI_MAC_INFO register and MPIPE_MAC_ENABLE registers indicates whether the port is able to be used in the device’s configuration. 7.5 Flow Control The XAUI MAC supports standard 802.3 pause-based flow control as well as 802.Qbb prioritybased flow control. The MAC can auto-generate either pause type based on configurable high water marks in the mPIPE iPKT buffer. The reception of pause frames triggers a pause condition on one or more TX queues. These queues can be mapped into the mPIPE eDMA arbiter to block traffic from one or more rings that target the MAC. mPIPE supports up to 16 priority queues in the iPKT buffer. Each has a programmable high water mark. The MPIPE_PR_PAUSE_THR registers contain the programmable high water marks. The lower 8 priority queues can be mapped directly to the eight 802.1Qbb PFC queues. The upper 8 priority queues are used for 802.3 pause and/or mPIPE loopback channels. The PAUSE_MODE bit of the MPIPE_XAUI_MAC_INTFC_CTL register controls the type of pause frame to be sent when priority queues become full. 7.5.1 Priority-Based Flow Control Incoming RX packets are assigned to a priority queue based on the data extracted from the VLAN priority tag as per IEEE 802.1Qbb. The RX queue selection can be overridden via the PRQ_OVD and PRQ_OVD_VAL bits of the MPIPE_XAUI_MAC_INTFC_CTL register. As RX packets are written into their associated queues, a counter tracks the queue’s fullness. Once the high water mark is reached, the MAC can automatically dispatch priority pause frames indicating back pressure on one or more queues. The MAC can be configured to monitor an number of queues via the TX_PRQ_ENA bit of the MPIPE_XAUI_MAC_INTFC_CTL register. The MPIPE_XAUI_MAC_INTFC_TX_CTL register contains settings to override the mPIPE eDMA back pressure so that one or more queues can be manually paused or unpaused. 7.6 Interrupts Each XAUI port has a dedicated interrupt input pin. This interrupt pin is intended for connection to an external PHY. When this interrupt input is high, an interrupt is signaled to the PHY_INT bit of the MPIPE_XAUI_INTERRUPT_STATUS register. This interrupt should typically be operated in mode-0 since the interrupt is level-sensitive. In addition to the external (PHY) interrupt, the XAUI MAC produces a number of interrupt conditions as described in the MPIPE_XAUI_INTERRUPT_STATUS register description. 7.7 Timestamping and IEEE 1588 The XAUI MAC supports IEEE 1588 frame recognition and generation for system wide time correlation. IEEE 1588 is a standard for precision time synchronization in local area networks. It works with the exchange of special Precision Time Protocol (PTP) frames. The PTP messages can be transported over IEEE 802.3/Ethernet, over Internet Protocol Version 4 or over Internet Protocol Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 145 Chapter 7 XAUI MAC Interface Version 6 as described in the annex of IEEE P1588.D2.1. Most 1588 functionality can be implemented in software but for greatest accuracy hardware assist is required to detect when PTP event messages pass the GMII interface (clock time-stamp point). The MAC detects when the PTP event messages: sync, delay_req, pdelay_req and pdelay_resp are transmitted and received. The MPIPE_XAUI_TX_1588/MPIPE_XAUI_RX_1588 registers indicate the message time-stamp point of PTP event frames. These timestamp registers can be correlated back to the MPIPE_TIMESTAMP_VAL if desired. Synchronization between master and slave clocks is a two stage process. First the offset between the master and slave clocks is corrected by the master sending a sync frame to the slave with a follow up frame containing the exact time the sync frame was sent. Hardware assist modules at the master and slave side detect exactly when the sync frame was sent by the master and received by the slave. The slave then corrects its clock to match the master clock. Second the transmission delay between the master and slave is corrected. The slave sends a delay request frame to the master which sends a delay response frame in reply. Hardware assist modules at the master and slave side detect exactly when the delay request frame was sent by the slave and received by the master. The slave will now have enough information to adjust its clock to account for delay. For example if the slave was assuming zero delay the actual delay will be half the difference between the transmit and receive time of the delay request frame (assuming equal transmit and receive times) because the slave clock will be lagging the master clock by the delay time already. For hardware assist it is necessary to time-stamp when sync and delay_req messages are sent and received. The time-stamp is taken when the message time-stamp point passes the clock timestamp point. For Ethernet the message time-stamp point is the SFD and the clock time-stamp point is the MII interface. (The 1588 spec refers to sync and delay_req messages as event messages as these require time-stamping. Follow up, delay response and management messages do not require time-stamping and are referred to as general messages.) 1588 version 2 defines two additional PTP event messages. These are the peer delay request (Pdelay_Req) and peer delay response (Pdelay_Resp) messages. These messages are used to calculate the delay on a link. Nodes at both ends of a link send both types of frames (regardless of whether they contain a master or slave clock). The Pdelay_Resp message contains the time at which a Pdelay_Req was received and is itself an event message. The time at which a Pdelay_Resp message is received is returned in a Pdelay_Resp_Follow_Up message. 1588 version 2 introduces transparent clocks of which there are two kinds, peer-to-peer (P2P) and end-to-end (E2E). Transparent clocks measure the transit time of event messages through a bridge and amend a correction field within the message to allow for the transit time. P2P transparent clocks additionally correct for the delay in the receive path of the link using the information gathered from the peer delay frames. With P2P transparent clocks delay_req messages are not used to measure link delay. This simplifies the protocol and makes larger systems more stable. The sof_tx and sof_rx signals are provided to indicate the message time-stamp point and follow up signals are provided to indicate the presence of an event frame. With 1588 version 1 for a given data-rate the assertion of the event frame signals will be a fixed delay after the sof signals so taking the time-stamp could be delayed until the event signals are asserted and suitable compensation made. The XGM recognizes seven different encapsulations for PTP event messages: 146 1. 1588 version 1 (UDP/IPv4 multicast) 2. 1588 version 2 (UDP/IPv4 multicast) 3. 1588 version 2 (UDP/IPv6 multicast) 4. 1588 version 2 (Ethernet multicast) Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Timestamping and IEEE 1588 5. 1588 version 1 (UDP/IPv4/VLAN multicast) 6. 1588 version 2 (UDP/IPv4/VLAN multicast) 7. 1588 version 2 (UDP/IPv6/VLAN multicast) Example of a sync frame in the 1588 version 1 (UDP/IPv4) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 0800 IP stuff (Octets 14-22) UDP (Octet 23) 11 IP stuff (Octets 24-29) IP DA (Octets 30-32) E00001 IP DA (Octet 33) 81 or 82 or 83 or 84 source IP port (Octets 34-35) dest IP port (Octets 36-37) 013F other stuff (Octets 38-42) versionPTP (Octet 43) 01 other stuff (Octets 44-73) control (Octet 74) 00 other stuff (Octets 75-168) Example of a delay request frame in the 1588 version 1 (UDP/IPv4) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 0800 IP stuff (Octets 14-22) UDP (Octet 23) 11 IP stuff (Octets 24-29) IP DA (Octets 30-32) E00001 IP DA (Octet 33) 81 or 82 or 83 or 84 source IP port (Octets 34-35) dest IP port (Octets 36-37) 013F other stuff (Octets 38-42) versionPTP (Octet 43) 01 other stuff (Octets 44-73) control (Octet 74) 01 other stuff (Octets 75-168) For 1588 version 1 messages sync and delay request frames are indicated by the XGM if the frames type field indicates TCP/IP, UDP protocol is indicated, the destination IP address is 224.0.1.129/130/131 or 132, the destination UDP port is 319 and the control field is correct. The control field is 0x00 for sync frames and 0x01 for delay request frames. For 1588 version 2 messages the type of frame is determined by looking at the message type field in the first byte of the PTP frame. Whether a frame is version 1 or version 2 can be determined by looking at the version PTP field in the second byte of both version 1 and version 2 PTP frames. In version 2 messages sync frames have a message type value of 0x0, delay_req have 0x1, pdelay_req have 0x2 and pdelay_resp have 0x3. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 147 Chapter 7 XAUI MAC Interface Example of a sync frame in the 1588 version 2 (UDP/IPv4) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 0800 IP stuff (Octets 14-22) UDP (Octet 23) 11 IP stuff (Octets 24-29) IP DA (Octets 30-33) E0000181 source IP port (Octets 34-35) dest IP port (Octets 36-37) 013F other stuff (Octets 38-41) messagetype (Octet 42) 00 version PTP (Octet 43) 02 Example of a pdelay_req frame in the 1588 version 2 (UDP/IPv4) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 0800 IP stuff (Octets 14-22) UDP (Octet 23) 11 IP stuff (Octets 24-29) IP DA (Octets 30-33) E000006B source IP port (Octets 34-35) dest IP port (Octets 36-37) 013F other stuff (Octets 38-41) messagetype (Octet 42) 02 version PTP (Octet 43) 02 Example of a sync frame in the 1588 version 2 (UDP/IPv6) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 86dd IP stuff (Octets 14-19) UDP (Octet 20) 11 IP stuff (Octets 21-37) IP DA (Octets 38-53) FF0X000000000181 source IP port (Octets 54-55) dest IP port (Octets 56-57) 013F other stuff (Octets 58-61) messagetype (Octet 62) 00 other stuff (Octets 63-93) version PTP (Octet 94) 02 Example of a pdelay_resp frame in the 1588 version 2 (UDP/IPv6) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 86dd IP stuff (Octets 14-19) UDP (Octet 20) 11 IP stuff (Octets 21-37) IP DA (Octets 38-53) FF0200000000006B source IP port (Octets 54-55) dest IP port (Octets 56-57) 013F other stuff (Octets 58-61) messagetype (Octet 62) 03 other stuff (Octets 63-93) version PTP (Octet 94) 02 148 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Timestamping and IEEE 1588 Example of a sync frame in the 1588 version 2 (Ethernet multicast) format. For the multicast address 011B19000000 sync and delay request frames are recognized depending on the messagetype field, 00 for sync and 01 for delay request: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) 011B19000000 SA (Octets 6 - 11) Type (Octets 12-13) 88F7 messagetype (Octet 14) 00 version PTP (Octet 15) 02 Example of a pdelay_req frame in the 1588 version 2 (Ethernet multicast) format, these need a special multicast address so they can get through ports blocked by the spanning tree protocol. For the multicast address 0180C200000E sync, pdelay request and pdelay response frames are recognized depending on the messagetype field, 00 for sync, 02 for pdelay request and 03 for pdelay response. Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) 0180C200000E SA (Octets 6 - 11) Type (Octets 12-13) 88F7 messagetype (Octet 14) 02 version PTP (Octet 15) 02 Also PTP frames encapsulated in UDP/IPv4 or IPv6 and VLAN are supported. VLAN frames are indicated by 8100 for type field and the next type has to be IPv4 or IPv6. Example of a sync frame in the 1588 version 1 (UDP/Ipv4/VLAN) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 8100 VLAN tag (Octets 14-15) Type (Octets 16-17) 0800 IP stuff (Octets 18-26) UDP (Octet 27) 11 IP stuff (Octets 28-33) IP DA (Octets 34-36) E00001 IP DA (Octet 37) 81 or 82 or 83 or 84 source IP port (Octets 38-39) dest IP port (Octets 40-41) 013F other stuff (Octets 42-46) versionPTP (Octet 47) 01 other stuff (Octets 48-77) control (Octet 78) 00 other stuff (Octets 79-168) Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 149 Chapter 7 XAUI MAC Interface Example of a delay request frame in the 1588 version 1 (UDP/Ipv4/VLAN) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 8100 VLAN tag (Octets 14-15) Type (Octets 16-17) 0800 IP stuff (Octets 18-26) UDP (Octet 27) 11 IP stuff (Octets 28-33) IP DA (Octets 34-36) E00001 IP DA (Octet 37) 81 or 82 or 83 or 84 source IP port (Octets 38-39) dest IP port (Octets 40-41) 013F other stuff (Octets 42-46) versionPTP (Octet 47) 01 other stuff (Octets 48-77) control (Octet 78) 01 other stuff (Octets 79-168) Example of a sync frame in the 1588 version 2 (UDP/IPv4/VLAN) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 8100 VLAN tag (Octets 14-15) Type (Octets 14-17) 0800 IP stuff (Octets 18-26) UDP (Octet 27) 11 IP stuff (Octets 28-33) IP DA (Octets 34-37) E0000181 source IP port (Octets 38-39) dest IP port (Octets 40-41) 013F other stuff (Octets 42-45) messagetype (Octet 46) 00 version PTP (Octet 47) 02 Example of a pdelay_req frame in the 1588 version 2 (UDP/IPv4/VLAN) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 8100 VLAN tag (Octets 14-15) Type (Octets 16-17) 0800 IP stuff (Octets 18-26) UDP (Octet 27) 11 IP stuff (Octets 30-33) IP DA (Octets 34-37) E000006B source IP port (Octets 38-39) dest IP port (Octets 40-41) 013F other stuff (Octets 42-45) messagetype (Octet 46) 02 version PTP (Octet 47) 02 150 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice MDIO Example of a sync frame in the 1588 version 2 (UDP/IPv6/VLAN) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 8100 VLAN tag (Octets 14-15) Type (Octets 16-17) 86dd IP stuff (Octets 18-23) UDP (Octet 24) 11 IP stuff (Octets 25-41) IP DA (Octets 42-57) FF0X0000000001B1 source IP port (Octets 58-59) dest IP port (Octets 60-61) 013F other stuff (Octets 62-65) messagetype (Octet 66) 00 other stuff (Octets 67-97) version PTP (Octet 98) 02 Example of a pdelay_resp frame in the 1588 version 2 (UDP/IPv6/VLAN) format: Preamble/SFD 55555555555555D5 DA (Octets 0 - 5) SA (Octets 6 - 11) Type (Octets 12-13) 8100 VLAN tag (Octets 14-15) Type (Octets 16-17) 86dd IP stuff (Octets 18-23) UDP (Octet 24) 11 IP stuff (Octets 25-41) IP DA (Octets 42-57) FF020000000000B5 source IP port (Octets 58-59) dest IP port (Octets 60-61) 013F other stuff (Octets 62-65) messagetype (Octet 66) 03 other stuff (Octets 67-97) version PTP (Octet 98) 02 7.8 MDIO An MDIO interface is provided to allow configuration of off an chip PHY. Multiple XAUI MACs may share a single MDIO port depending on the device configuration (see datasheet). When more than one port shares the MDIO, access is coordinated in software and enabled via the MPIPE_MAC_MANAGE register. The MAC’s MPIPE_XAUI_MDIO_CONTROL register is used to operate the MDIO interface. 7.9 Statistics The XAUI MAC provides statistics counters for transmitted and received frames including byte counts, frame counts, specific frame sizes, and specific frame types. These registers are in the MAC starting with the MPIPE_XAUI_TRANSMITTED_OCTETS_LO register. 7.10Filtering Incoming packets may be checked for an exact match against up to 8 MAC addresses in the MPIPE_XAUI_EXACT_MATCH registers as well as a hashed match against the MPIPE_XAUI_RX_HASH registers (including the MPIPE_XAUI_RX_HASH_BOTTOM and MPIPE_XAUI_RX_HASH_TOP registers). They can also be checked against a set of type-match registers. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 151 Chapter 7 XAUI MAC Interface The address checking, or filter block determines which receive frames should be sent to mPIPE. Whether a frame is sent or dropped depends on what is enabled in the receive configuration register, the contents of the specific address type match, and hash registers and the frame’s destination address and type/length field. The exact address match address is split between two registers, high and low. To enable or disable exact address matching, write to the exact address registers: the low register to disable, the high register to enable. The first six bytes (48 bits) of an Ethernet frame make up the destination address. The first bit of the destination address (i.e. the LSB of the first byte of the frame) is the group/individual bit. This is one for multicast addresses and zero for unicast. The all ones address is the broadcast address and a special case of multicast. The MAC supports recognition of eight specific addresses. Each specific address requires two registers, specific address register bottom and specific address register top. Specific address register bottom stores the first four bytes of the destination address and specific address register top contains the last two bytes. The addresses stored can be specific, group, local or universal. See IEEE Standard 802-2001, Clause 9 for a detailed description of 802 addressing. 7.10.1 Type ID Checking The contents of the four type match registers are compared to the length/type ID in bytes 13 and 14 of received frames. If there is a match, the frames are copied into the RX FIFO. The following example illustrates the use of the address and type ID match registers for a MAC address of 21:43:65:87:A9:CB. Preamble 55 SFD D5 DA (Octet0 - LSB) 21 DA(Octet 1) 43 DA(Octet 2) 65 DA(Octet 3) 87 DA(Octet 4) A9 DA (Octet5 - MSB) CB SA (LSB) 00 SA 00 SA 00 SA 00 SA 00 SA (MSB) 00 Type ID 43 Type ID 21 The sequence above shows the beginning of an Ethernet frame. Byte order of transmission is from top to bottom as shown. For a successful match to specific address 1, the following address matching registers must be set up: MPIPE_XAUI_EXACT_MATCH_BOTTOM_0 = 0x87654321 MPIPE_XAUI_EXACT_MATCH_TOP_0 0x0000CBA9 And for a successful match to type ID, the following type ID match register must be set up: MPIPE_XAUI_TYPE_MATCH0 = 0x80004321 7.10.2 Broadcast Address The broadcast address of 0xFFFFFFFFFFFF is recognized if the ‘disable broadcast’ bit in the receive configuration register is zero. To enable type matching, set bit 31 in the type match registers. 152 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Special Modes 7.10.3 Hash Addressing The hash address register is 64 bits long and takes up two locations in the memory map. The least significant bits are stored in hash register bottom and the most significant bits in hash register top. The unicast hash enable and the multicast hash enable bits in the network configuration register enable the reception of hash matched frames. The destination address is reduced to a 6 bit index into the 64 bit hash register using the following hash function, which is an exclusive OR of every sixth bit of the destination address. hash_index[5] hash_index[4] hash_index[3] hash_index[2] hash_index[1] hash_index[0] = = = = = = da[5] da[4] da[3] da[2] da[1] da[0] ^ ^ ^ ^ ^ ^ da[11] da[10] da[09] da[08] da[07] da[06] ^ ^ ^ ^ ^ ^ da[17] da[16] da[15] da[14] da[13] da[12] ^ ^ ^ ^ ^ ^ da[23] da[22] da[21] da[20] da[19] da[18] ^ ^ ^ ^ ^ ^ da[29] da[28] da[27] da[26] da[25] da[24] ^ ^ ^ ^ ^ ^ da[35] da[34] da[33] da[32] da[31] da[30] ^ ^ ^ ^ ^ ^ da[41] da[40] da[39] da[38] da[37] da[36] ^ ^ ^ ^ ^ ^ da[47] da[46] da[45] da[44] da[43] da[42] In the hash function above, da[0] represents the least significant bit of the first byte received; that is, the multicast/unicast indicator, and da[47] represents the most significant bit of the last byte received. If the hash index points to a bit that is set in the hash register, the frame will be matched according to whether the frame is multicast or unicast. A multicast match will be signalled if the multicast hash enable bit is set: da[0] is 1 and the hash index points to a bit set in the hash register. A unicast match will be signalled if the unicast hash enable bit is set: da[0] is 0 and the hash index points to a bit set in the hash register. To receive all multicast frames, the hash register should be set with all ones and the multicast hash enable bit should be set in the receive configuration register. 7.11Special Modes The MAC supports a number of custom modes that may be used for non-standard Ethernet or optimized packet transport applications. 7.11.1 Pass All Frames Mode In this mode, the RX filters are not applied and all frames are passed to mPIPE. This is typically used in applications where the XAUI port is not terminating the stream (bump-on-wire, monitor applications). 7.11.2 Custom Preamble The standard Ethernet preamble can be overridden on TX to include custom bytes and passed through to mPIPE on RX. This allows additional payload to be included with each packet without changing the overall octet count for the frame. The NO_TX_PRE bit of the MPIPE_XAUI_MAC_INTFC_CTL register is used to allow a custom preamble on transmit. And the PASS_PREAMBLE bit of the MPIPE_XAUI_RECEIVE_CONFIGURATION register allows the preamble bytes to be to mPIPE. Additionally, if custom protocols require CRC to cover the custom preamble bytes, the CRC may be configured vie the PREAMBLE_CRC bit of the MPIPE_XAUI_RECEIVE_CONFIGURATION register and PREAMBLE_CRC bit of the MPIPE_XAUI_TRANSMIT_CONFIGURATION register. 7.11.3 Short IPG Additional byte-stuffing can be achieved by shortening the inter packet gap (IPG) on TX and RX. On RX, the MAC can handle an IPG that has been shortened from an average of 12 down to an average of 8. On TX, the inserted-IPG can be shortened using the DECREAS_IPG bit of the MPIPE_XAUI_TRANSMIT_CONFIGURATION register. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 153 Chapter 7 XAUI MAC Interface The IPG is used to compensate for link partners that have a small difference in their reference clocks. When using a shortened IPG, it is up to the system designer to insure that sufficient IPG remains to prevent overflow of either component’s elasticity FIFO. 7.12SERDES Control The SERDES has programmable control for drive level, emphasis, equalization, PLL settings etc. These SERDES settings are configured automatically by hardware based on port mode and power up calibration. Settings may be overridden using the SERDES register interface in MPIPE_XAUI_SERDES_CONFIG. The SERDES configuration registers are not specified in the Tile Processor and I/ O Device Guide for the TILE-Gx Family of Processors (UG404). 7.13LEDs Each XAUI MAC has a dedicated pair of LED outputs. These are typically driven automatically based on link state and activity. But software may override the LED behavior and state via the LED_MODE and OVD_VAL bit settings of the MPIPE_XAUI_PCS_CTL register. 154 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 8 SGMII MAC I NTERFACE 8.1 Introduction The TILE-Gx™ GbE MAC and PCS provide an SGMII-based 1Gb/s Ethernet interface to mPIPE™. Configuration of the MAC is through the mPIPE MMIO space. For more information on mPIPE operations and features, refer to Chapter 6: mPIPE Architecture. 8.1.1 Features The GbE MAC and PCS interface: • Is compatible with IEEE Standard 802.3 • Provides an SGMII interface with CDR (inband clock recovery) • Supports for 10/100Mbps and half-duplex • Provides a PCS layer with auto-negotiation • Supports 802.1Qbb priority-based flow control • Supports Precision timestamping and IEEE 1588 • Supports MDIO and Interrupt interfaces to off-chip PHYs • Supports multiple loopback modes for in-system test and characterization • Supports independent polarity reversal on TX and RX. • Provides support for 802.3az Energy Efficient Ethernet 8.2 Register Spaces Access to the GbE MAC is via the mPIPE’s MAC interface in MMIO space. The format for the physical address is shown in Figure 8-1 and described in Table 8-1. Figure 8-1: TRIO Interface Format Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 155 Chapter 8 SGMII MAC Interface Table 8-1. TRIO_CFG_REGION_ADDR Register Bit Descriptions Bits Name Type Reset Description 39:35 SVC_DOM RW 0 This field of the address indexes the 32 entry service domain table. 28:26 REGION RW 0 This field of the address selects the region (address space) to be accessed. For the config region, this field must be 0. 21 INTFC RW 0 Interface being accessed. Value Name Meaning 0 MPIPE Access to MPIPE registers 1 MAC Access to MAC registers 20:16 MAC_SEL RW 0 Selects the MAC being accessed when bit[21] is 1. 15:0 REG RW 0 Register address. Registers in the GbE MAC are all 8-bytes. Accesses smaller than 8-bytes are not supported and will result in an MMIO error returned to the requesting Tile. 8.3 MAC and Channel Mapping MACs are assigned MAC-Numbers in hardware. This number is used in the MAC_SEL field of the MMIO address when accessing MAC registers. System software can perform MAC “discovery” by reading the MAC_INFO register, which is always located at address 0x0000 in each MAC’s address space. Each GbE MAC is also assigned to a specific hardware channel in mPIPE. This mapping is provided in the CHANNEL bit of the MPIPE_GBE_MAC_INFO register and is also described by the mPIPE’s MPIPE_MACn_MAP registers (MPIPE_MAC0_MAP register, for example). The channel numbers are assigned to eDMA rings by system software. See “Ring to Channel Mapping” on page 127. 8.4 Port Configuration The GbE port is enabled via the MPIPE_MAC_ENABLE register. Basic MAC settings are configured in the MPIPE_GBE_NETWORK_CONTROL and MPIPE_GBE_NETWORK_CONFIGURATION registers. Once enabled, the link status can be monitored through the MPIPE_GBE_NETWORK_STATUS register. A port that is disabled will automatically turn off the SERDES and operate in a low power mode. MAC registers, interrupts, and MDIO functions are still accessible when a port is disabled. 8.4.1 Lane Sharing with XAUI XAUI ports can be reconfigured into four independent SGMII ports. This is controlled by the MPIPE_MAC_ENABLE register. When one or more lanes are operating in SGMII mode, the XAUI MAC is no longer used; the SGMII MACs control the lane instead. Refer to Chapter 7: XAUI MAC Interface for more information. The AVAIL bits of the MPIPE_GBE_MAC_INFO register and MPIPE_MAC_ENABLE registers indicates whether or not the port is able to be used in the device’s configuration. 156 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Flow Control Some SGMII control functions are still managed through the associated XAUI port’s configuration registers including: • TX/RX polarity (via the MPIPE_XAUI_PCS_CTL register) • SERDES controls (via the MPIPE_XAUI_SERDES_CONFIG register) 8.5 Flow Control The GbE MAC supports standard 802.3 pause-based flow control as well as 802.Qbb prioritybased flow control. The MAC can auto-generate either pause type based on configurable high water marks in the mPIPE iPKT buffer. The reception of pause frames triggers a pause condition on one or more TX queues. These queues can be mapped into the mPIPE eDMA arbiter to block traffic from one or more rings that target the MAC. mPIPE supports up to 16 priority queues in the iPKT buffer. Each has a programmable high water mark. The MPIPE_PR_PAUSE_THR registers contain the programmable high water marks. The lower 8 priority queues can be mapped directly to the eight 802.1Qbb PFC queues. The upper 8 priority queues are used for 802.3 pause and/or mPIPE loopback channels. The PAUSE_MODE bit of the MPIPE_GBE_MAC_INTFC_CTL register controls the type of pause frame to be sent when priority queues become full. 8.5.1 Priority-Based Flow Control Incoming RX packets are assigned to a priority queue based on the data extracted from the VLAN priority tag as per IEEE 802.1Qbb. The RX queue selection can be overridden by setting the PRQ_OVD or PRQ_OVD_VAL bits in the MPIPE_GBE_MAC_INTFC_CTL register. As RX packets are written into their associated queues, a counter watches the queue to check its level. Once the high water mark is reached, the MAC can automatically dispatch priority pause frames, indicating back pressure on one or more queues. The MAC can be configured to monitor an number of queues via the TX_PRQ_ENA bit of the MPIPE_GBE_MAC_INTFC_CTL register. The MPIPE_GBE_MAC_INTFC_TX_CTL register contains settings to override the mPIPE eDMA back pressure so that one or more queues can be manually paused or unpaused. 8.6 Interrupts Each SGMII port has an associated GPIO pin that is designated for PHY interrupt use if needed. It is up to the system software to direct this GPIO’s input to an interrupt targeting the associated GbE software driver. The GbE MAC produces a number of interrupt conditions as described in the MPIPE_GBE_INTERRUPT_STATUS register description. 8.7 Timestamping and IEEE 1588 The GbE MAC supports IEEE 1588 frame recognition and generation for system-wide time correlation. Operation of the 1588 frame recognition is similar to the way it is handled in XAUI MAC. Refer to Chapter 7: XAUI MAC Interface for a description of frame recognition operations. The GbE MAC has a dedicated timestamper in the MPIPE_GBE_1588_TIMER registers (including the MPIPE_GBE_1588_TIMER_ADJUST register) and captures timestamps into the MPIPE_GBE_PTP registers (including the MPIPE_GBE_PTP_PEER_EVENT_FRAME_RECV_SECS register). The GbE timestamp can be correlated to the mPIPE timestamper in software by performing iterative/alternating reads and using the timestamper adjustment controls. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 157 Chapter 8 SGMII MAC Interface 8.8 MDIO An MDIO interface is provided to allow configuration of off an chip PHY. Multiple GbE MACs may share a single MDIO port depending on the device configuration (see datasheet). When more than one port shares the MDIO, access is coordinated in software and enabled via the MPIPE_MAC_MANAGE register. The MAC’s MPIPE_GBE_PHY_MAINTENANCE register is used to operate the MDIO interface. 8.9 10/100Mbps Support The GbE MAC supports operation in 10/100Mbps modes via auto-negotiation or fixed configuration. The lower speed modes are achieved by replicating symbols on the link as per the SGMII specification (see “Serial-GMII Specification, Revision 1.7” on page 584). 8.10Half-Duplex Support Although the SGMII interface from TILE-Gx (MAC) to the PHY is always full-duplex, Half-duplex PHY-2-PHY links are supported. The MAC will retransmit frames when collisions occur within the 802.3-specified collision window. 8.11Energy Efficient Ethernet Support (IEEE 802.3az) IEEE 802.3az adds support for energy efficiency to Ethernet. These are the key features of 802.3az enhancements: • Allows a system’s transmit path to enter a low power mode if there is nothing to transmit • Allows a PHY to detect whether its link partner’s transmit path is in a low power mode, therefore allowing the system’s receive path to enter low power mode. • Ensures that the link remains up during lower power mode and no frames are dropped • Enables asymmetric, one direction transmissions in low power mode while the other is transmitting normally • Provides LPI (Low Power Idle) signaling used to control entry and exit to and from low power modes • Ensures that LPI signaling can only take place if both sides have indicated support for it through auto-negotiation 8.11.1 802.3az Operation 158 • Low power control is done at the MII (reconciliation sub-layer). • As an architectural convenience, in writing the 802.3az it is assumed that transmission is deferred by asserting carrier sense; in practice it will not be done this way. This system will know when it has nothing to transmit and only enter low power mode when it is not transmitting. • Power Idle (PI) should not be requested unless the link has been up for at least one second. • LPI is signaled on the GMII transmit path by asserting 0x01 on txd with tx_en low and tx_en high. • A PHY on seeing LPI requested on the MII will send the sleep signal before going quiet. After going quiet it will periodically emit refresh signals. • The sleep, quiet and refresh periods are defined in Table 78-2 of the 802.3az specification. For 1000BASE-X the sleep period is 20us, the quiet period is 2.5ms and the refresh period is 20us. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice PCS Auto-Negotiation • 1000BASE-X is required to go quiet after sleep is signaled. The easiest way to handle this is to write to disable transmit in the SerDes. • SGMII and XFI are not part of 802.3az and should not go quiet after sleep is signaled. • LPI mode ends by transmitting normal idle for the wake time. A default time has been established for this condition, but it can be adjusted in software using the Link Layer Discovery Protocol (LLDP) described in Clause 79 pf 802.3az. • LPI is indicated at the receive side when sleep and refresh signaling has been detected. 8.11.2 LPI Operation in the MAC System software must control LPI since this is a system level function. LPI operation is straightforward and firmware should be capable of responding within the required timeframes. Auto-negotiation indicates EEE capability using next page autonegotiation. For the transmit path: • If the link has been up for 1 second and there is nothing being transmitted, write to the LPI bit in the network control (MPIPE_GBE_NETWORK_CONTROL) register • Wake up by clearing the ENA_LPI bit in the XGM transmit control (MPIPE_GBE_MAC_INTFC_TX_CTL) register For the receive path: • Wait for an interrupt to indicate that LPI has been received • Take any software action desired to reduce power (decrease mPIPE frequency for example or change from polling to interrupt modes on receive queues) • Wait for an interrupt to indicate that regular power operation has been received and then reenable any receive features that have been suspended. 8.12PCS Auto-Negotiation An auto-negotiation block provides a means for the PCS to establish automatic link configuration. It is performed at power-up or during normal operation if requested by a link partner or through the restart auto-negotiation bit in the PCS control register. By default the Gigabit Ethernet MAC has auto-negotiation enabled in the PCS control (MPIPE_GBE_PCS_CTL) register and full and half duplex capability enabled in the PCS auto-negotiation advertisement register. The Pause capability is disabled in the advertisement (MPIPE_GBE_PCS_AUTO_NEG) register by default. If auto-negotiation is not required, then bit 12 (the AUTO_NEG bit) of the PCS control register needs to be set LOW. When a new base or next page is received from the link partner, a PCS link partner page received interrupt is set [bit 17 (the PCS_PART_PAGE bit) of the interrupt status (MPIPE_GBE_INTERRUPT_STATUS) register]. The first time this interrupt is received, it indicates a base page received, and on subsequent reads it indicate next pages. In order for the next page exchange to work, the next page register (0x21c) must be written within 10 ms of receiving a new page from the link partner. If the link partner is requesting next pages and GbE MAC has none to send, then the next page register should be written with the null message (0x2001). The value 0x0000 must not be written to the next page (MPIPE_GBE_PCS_AUTO_NEG_NXT_PG) register. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 159 Chapter 8 SGMII MAC Interface The GbE MAC signals completion of auto-negotiation through the PCS auto-negotiation complete interrupt, on bit 16 of the interrupt status (MPIPE_GBE_INTERRUPT_STATUS) register. Auto-negotiation completion is also indicted by bit 5 of the PCS status (MPIPE_GBE_PCS_STS) register. The PCS resolves the GbE MAC and link partner’s abilities and reports the result in the network status register. Pause transmit and receive resolution are reported in accordance with Table 37-4 in the IEEE 802.3 specification. If full duplex capability is resolved, the duplex resolution bit is set HIGH in the network status (MPIPE_GBE_NETWORK_STATUS) register. If half duplex capability is resolved, the duplex resolution bit will be LOW. If the GbE MAC and its link partner cannot resolve a common duplex capability, the duplex resolution (FULL_DUPLEX) bit is not set and link will be indicated as being down (bit 2 in the PCS status (MPIPE_GBE_PCS_STS) register and bit 0 in the network status (MPIPE_GBE_NETWORK_STATUS) register will both be zero) when auto negotiation completes. Although the GbE MAC reports the auto-negotiation resolution, it does not automatically reconfigure its duplex and pause states. So it is necessary for management software to set the duplex bit in the network configuration (MPIPE_GBE_NETWORK_CONFIGURATION) register, if it reads the duplex resolution bit as being set in the network status register. 8.12.1 PCS Collision Detect and Carrier Sense The PCS provides both the carrier sense and collision signals for use by the MAC sub-layer when the ten bit interface (TBI) is active. CRS (Carrier Sense) is generated by the following conditions: • The receiver has decoded a start of packet/end of packet or receive carrier extension is active. This state is indicated internally to the PCS by CRS receive. • tx_en is active, or carrier extension is active for transmit. The collision signal is generated whenever the PCS is requested to transmit an Ethernet frame when the crs receive signal indicates it is active. The col signal remains active for the duration of the collision. Both crs and col are asserted, regardless of the PCS’ mode (operating in half duplex or full duplex mode). 8.12.2 Link Status The PCS link status is indicated on bit 2 of the PCS status (MPIPE_GBE_PCS_STS) register, on bit 0 of the network status register, and on bit 9 of the interrupt status register. An interrupt is generated each time the PCS link status changes (that is, whenever the link is good or the link is bad). When auto-negotiation is disabled, the link status value is determined based on whether or not the PCS is in synchronized state. When auto-negotiation is enabled, the link status value is determined by successful completion of auto-negotiation. 8.13Statistics The GbE MAC provides statistics counters for transmitted and received frames including byte counts, frame counts, specific frame sizes, and specific frame types. These registers are in the MAC starting with the MPIPE_GBE_OCTETS_TX_LO register. 8.14Filtering Incoming packets can be checked for an exact match against up to 4 MAC addresses in the MPIPE_GBE_SPECIFIC_ADDRESS (see MPIPE_GBE_SPECIFIC_ADDRESS_1_BOTTOM_31_0, as an example). They can also be checked against a set of type-match registers. 160 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Filtering The address checking, or filter block determines which receive frames should be sent to mPIPE. The decision to send or drop a frame depends on what is enabled in the receive configuration (MPIPE_XAUI_RECEIVE_CONFIGURATION) register, the contents of the specific address type match, and the frame’s destination address and type/length field. The exact address match address is split between two registers, high and low. To enable or disable exact address matching, write to the exact address registers (the MPIPE_XAUI_EXACT_MATCH_BOTTOM_0 register, for example): the low register to disable, the high register to enable. The first six bytes (48 bits) of an Ethernet frame make up the destination address. The first bit of the destination address (that is, the LSB of the first byte of the frame) is the group/individual bit. This is one for multicast addresses and zero for unicast. The all ones address is the broadcast address and a special case of multicast. The MAC supports recognition of four specific addresses. Each specific address requires two registers, specific address register bottom and specific address register top. Specific address register bottom stores the first four bytes of the destination address and specific address register top contains the last two bytes. The addresses stored can be specific, group, local or universal. See IEEE Standard 802-2001, Clause 9 for a detailed description of 802 addressing. 8.14.1 Type ID Checking The contents of the four type match registers are compared to the length/type ID in bytes 13 and 14 of received frames. If there is a match, the frames are copied into the RX FIFO. The following example illustrates the use of the address and type ID match registers for a MAC address of 21:43:65:87:A9:CB. Preamble 55 SFD D5 DA (Octet0 - LSB) 21 DA(Octet 1) 43 DA(Octet 2) 65 DA(Octet 3) 87 DA(Octet 4) A9 DA (Octet5 - MSB) CB SA (LSB) 00 SA 00 SA 00 SA 00 SA 00 SA (MSB) 00 Type ID 43 Type ID 21 The sequence above shows the beginning of an Ethernet frame. Byte order of transmission is from top to bottom as shown. For a successful match to specific address 1, the following address matching registers must be set up: MPIPE_GBE_SPECIFIC_ADDRESS_1_BOTTOM_31_0 = 0x87654321 MPIPE_GBE_SPECIFIC_ADDRESS_1_TOP_47_32 0x0000CBA9 And for a successful match to type ID, the following type ID match register must be set up: MPIPE_GBE_TYPE_ID_MATCH_1 = 0x80004321 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 161 Chapter 8 SGMII MAC Interface 8.14.2 Broadcast Address The broadcast address of 0xFFFFFFFFFFFF is recognized if the ‘disable broadcast’ bit in the receive configuration (MPIPE_XAUI_RECEIVE_CONFIGURATION) register is zero. To enable type matching, set bit 31 (the ENA bit) in the type match registers (MPIPE_XAUI_TYPE_MATCH0, MPIPE_XAUI_TYPE_MATCH1, MPIPE_XAUI_TYPE_MATCH2, and MPIPE_XAUI_TYPE_MATCH3 registers). 8.14.3 Hash Addressing The hash address (MPIPE_GBE_HASH_BOTTOM_31_0 and MPIPE_GBE_HASH_TOP_63_32) registers is 64 bits long and takes up two locations in the memory map. The least significant bits are stored in hash register bottom and the most significant bits in hash register top. The hash enable (ENA_HASH_UNI) and the multicast hash enable (ENA_HASH_MULTI) bits in the network configuration (MPIPE_XAUI_RECEIVE_CONFIGURATION) register enable the reception of hash matched frames. The destination address is reduced to a 6-bit index into the 64-bit hash register using the following hash function, which is an exclusive OR of every sixth bit of the destination address. hash_index[5] hash_index[4] hash_index[3] hash_index[2] hash_index[1] hash_index[0] = = = = = = da[5] da[4] da[3] da[2] da[1] da[0] ^ ^ ^ ^ ^ ^ da[11] da[10] da[09] da[08] da[07] da[06] ^ ^ ^ ^ ^ ^ da[17] da[16] da[15] da[14] da[13] da[12] ^ ^ ^ ^ ^ ^ da[23] da[22] da[21] da[20] da[19] da[18] ^ ^ ^ ^ ^ ^ da[29] da[28] da[27] da[26] da[25] da[24] ^ ^ ^ ^ ^ ^ da[35] da[34] da[33] da[32] da[31] da[30] ^ ^ ^ ^ ^ ^ da[41] da[40] da[39] da[38] da[37] da[36] ^ ^ ^ ^ ^ ^ da[47] da[46] da[45] da[44] da[43] da[42] In the hash function above, da[0] represents the least significant bit of the first byte received; that is, the multicast/unicast indicator, and da[47] represents the most significant bit of the last byte received. If the hash index points to the IDX bit that is set in the hash (MPIPE_HFH_INIT_CTL) register, the frame will be matched, depending on if the frame is multicast or unicast. A multicast match will be signalled if the multicast hash enable (MULTI_HASH_ENA) bit is set: da[0] is 1 and the hash index points to a bit set in the hash register. A unicast match will be signalled if the unicast hash enable (ENA_HASH_UNI) bit is set: da[0] is 0 and the hash index points to a bit set in the hash register. To receive all multicast frames, the hash (MPIPE_HFH_INIT_CTL) register should be set with all ones and the multicast hash enable (ENA_HASH_MULTI) bit should be set in the receive configuration (MPIPE_XAUI_RECEIVE_CONFIGURATION) register. 162 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 9 TILE-G X I NTERLAKEN I NTERFACE 9.1 Overview The TILE-Gx Interlaken port provides a channelized packet interface between mPIPE and 1 or more SERDES lanes. The interface is compliant to the Interlaken Protocol Definition, revision 1.2 as well as the Interlaken Interoperability Recommendations, revision 1.4. Note: Interlaken interface is supported in the TILE-Gx100 and TILE-Gx64 processors only. The Interlaken interface provides the following features: • 1 to 10 TX/RX lanes • 3.125Gbps or 6.25Gbps per lane (OIF CEI-6G-SR) • Packet and burst (interleaved) modes • Uni-directional support • Asymmetrical support (can have different number of TX and RX lanes) • In-band or out-of-band flow control • Configurable burst size with optimized burst scheduler • Programmable flow control calendar • Link level and channel flow control • Statistics registers, diagnostics, and test patterns • MMIO access to registers through mPIPE’s MAC configuration interface 9.1.1 Channel Mapping mPIPE and Interlaken both support channelized communication. While there is a one-to-one relationship between Interlaken and mPIPE channels, the numbers are different. Similarly, mPIPE priority queues are mapped to Interlaken channels by the hardware, but the number spaces are different. Depending on how many lanes are in use, the Interlaken interface will occupy 16, 20, or 24 channels and priority queues. Table 9-1 shows the mapping between Interlaken channels, mPIPE channels, and mPIPE priority queues for various configurations. 9.2 TX Interface Packets to be sent utilize mPIPE’s eDMA rings. Each ring is assigned to a single channel. Multiple rings may target the same channel. The hardware guarantees that a packet on a given channel won’t be interrupted by another ring targeting that same channel. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 163 Chapter 9 TILE-Gx Interlaken Interface Table 9-1. Mapping between Channels and Priority Number of Lanes Interlaken Channels mPIPE Channels mPIPE Priority Queues 1-4 0..15 12..27 16..31 5-8 0..19 8..27 12..31 9-10 0..23 4..27 8..31 In packet mode, a packet won’t be interleaved with other packets. In burst mode, packet data is interleaved from multiple channels. Packet-mode is configured by the PKT bit of the MPIPE_ILK_TX_CTL register. Per eDMA ring bandwidth control is provided through mPIPE’s egress arbiters which are configurable though the MPIPE_EDMA_RG_INIT_DAT_THRESH and MPIPE_EDMA_CTL registers. 9.2.1 Burst Scheduler The Interlaken TX interface attempts to optimize the bursts being sent as per the Interlaken Protocol Definition, revision 1.2. This scheduler requires the ability to look-ahead in the packet in order to determine burst fragmentation. The burst scheduler will provide optimum behavior if the MAX_BLKS bits of the MPIPE_EDMA_RG_INIT_DAT_THRESH register is set to at least 3 and BURST bits of the MPIPE_EDMA_RG_INIT_DAT_THRESH register is set to 1. If multiple mPIPE eDMA descriptors are used to generate packets and the final descriptor is smaller than BurstShort, it is possible that the scheduler will not provide the most efficient burst fragmentation. 9.2.2 Packet vs. Burst The Interlaken TX interface may be configured to operate either in packet-at-a-time or burst mode. In burst mode, packets by be interleaved at a burst boundary according to the Burst/MAX/ SHORT parameters of the Interlaken link. In packet mode, complete packets will be sent without any interleaving of other channels’ data. Packet mode is controlled by the PKT bits of the MPIPE_ILK_TX_CTL register. The Interlaken RX interface supports either packet or burst interleaving modes. No special settings are required since this is a property of the transmitter. 9.3 RX Interface Packets received from the Interlaken interface are forwarded to mPIPE on the associated channel (see table above). Packets may be as small as 1 byte or as large as 16256 bytes. The RX interface can receive either burst-interleaved or full packet data. The timestamp unit applies timestamps relative to the egress from the MAC itself. The latency from the pins of the chip to the egress of the MAC is generally consistent for a given lane configuration and data rate. 9.4 Flow Control The TILE-Gx Interlaken interface provides a configurable flow control interface both at the link and channel level. Flow control features include: 164 • In band or out-of-band support • Programmable calendar used to map channels to flow control bits • Link level flow control via calendar or multi-use bits Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Statistics • Packet or intra-packet level response 9.4.1 Link Level TX Flow Control The receiver provides link-level back pressure to prevent FIFO overruns due to any bandwidth mismatch between the Interlaken RX line side and the mPIPE ingress datapath. Sufficient buffering is provided to prevent packet drops as long as the connected device is compliant with the Interlaken latency requirements described in section 2.9 of the Interlaken Interoperability specification, Revision 1.4. Link level flow control can be sent to the link partner on one or more calendar bits as well as a multi-use bit. The TX link level flow control mappings are configured by MPIPE_ILK_TX_LINK_FC_CFG and MPIPE_ILK_TX_LINK_CAL_FC. 9.4.2 Channel-Based Flow Control Flow control calendar entries are mapped to Interlaken channels via the MPIPE_ILK_TX_CAL and MPIPE_ILK_RX_CAL registers. The Interlaken channels are mapped by hardware to the upper most priority queues in mPIPE. For example, if Interlaken is configured to use 24 channels, priority queues 8 to 31 would be mapped to channels 0 to 23 respectively. A programmable high water mark for each of the priority queues determines how much relative space each queue is allocated in the mPIPE iPKT buffer. 9.4.3 Link Level RX Flow Control One or more calendar bits received from the link partner can be mapped to RX link level flow control (for example back pressure the TILE-Gx Interlaken transmitter). Any one of the multi-use bits may also be assigned for ling level back pressure. The RX link level flow control mappings are configured by MPIPE_ILK_RX_LINK_FC_CFG and MPIPE_ILK_RX_LINK_CAL_FC. When the link level flow control bit is indicating XOFF, transmission is terminated from all channels. The amount of skid data is compliant with the Interlaken Interoperability specification, revision 1.4. For ports operating in packet mode as per the PKT bit of the MPIPE_ILK_TX_CTL register, the flow control can be applied at a packet boundary or within the packet. This is based on the setting in PKT_FC_FAST bit of the MPIPE_ILK_TX_CTL register. 9.4.4 Out-of-Band Flow Control The TILE-Gx Interlaken interface supports both in-band and out-of-band flow control as per the Interlaken specification. When out-of-band flow control is enabled, dedicated chip pins carry the flow control information. Out-of-band flow control is required for uni-directional links. 9.5 Statistics The TILE-Gx Interlaken interface provides the Interlaken-Alliance recommended statistics registers. This includes per-channel byte and packet counts as well as many per-lane error statistics. Each counter generates an interrupt on overflow via the MPIPE_ILK_INTERRUPT mechanism. These interrupts are reflected in the MPIPE_ILK_CTR_OVFL_nn registers. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 165 Chapter 9 TILE-Gx Interlaken Interface 9.6 Initialization Configuration of an Interlaken link requires cooperation between the two attached components. Parameters such as bit rate, number of lanes, type of flow control, flow control calendar, channel mapping, BurstMax, and BurstMin must be configured identically on both sides of the link. All link parameters must be established before enabling the MAC via the MPIPE_MAC_ENABLE register. 9.7 Error Handling A misconfigured link or poor physical channel can cause various types of link errors. Many types of errors are detected by the hardware and reflected in the MPIPE_ILK_INTERRUPT_STATUS register. Received packets with bad CRC are forwarded to mPIPE with the descriptor’s CE (CRC error) bit asserted. FIFO overruns occur when the link partner fails to obey a link-level flow control event. These errors generally indicate both an oversubscription of mPIPE bandwidth and an improperly configured flow control calendar at the attached device’s transmitter. FIFO overrun errors are reflected in the mPIPE descriptor’s ME (MacError) bit in the cases where partial packet fragments have been forwarded. 166 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 10 USB I NTERFACE 10.1Overview The TILE-Gx Universal Serial Bus (USB) system includes two host controllers and one device endpoint controller. The system is USB 2.0-compliant. Note that USB 2.0 is backward-compatible with 1.x devices, but has a higher transfer rate of 480 Mb/s (HS) than the 12 Mb/s (FS) transfers of USB 1.1, and the 1.5 Mb/s (LS) transfers of USB 1.0. Refer to www.usb.org for more information about the USB 2.0 specification. Two sets of the UTMI+ Low Pin Interface (ULPI) interface connect the system to external USB PHYs. Host controller [0] and endpoint controller [0] share a set of the processor’s external connections via USB0. Host controller [1] has dedicated ULPI connections to an external PHY via USB1. The USB subsystem uses the Mesh networks to communicate with the Tile cores via the memory networks, which includes one reQuest Dynamic Network (QDN), one Share Dynamic Network (SDN), and two Response Dynamic Network (RDN) networks. These networks carry MMIO requests, data transfers, and interrupt requests. From the Tile Processors’ viewpoint, the host controllers and the device endpoint channels are located at the same Mesh coordinates. In order to support bootloading and debugging over USB, the USB0 endpoint controller can operate in a standalone mode without any software driver running on the Tile side. In this mode, the USB endpoint controller handles all standard USB requests from an external USB host. Users can boot, debug, and use the tile-monitor application for data movement between the external host controller and the Tile Processors. USB1 USB0 12-Pin 12-Pin H H E CH2 CH1 CH0 Figure 10-1: Channel Descriptions This chapter is organized as follows: 10.2 External I/O Interface describes the external PHY interface. 10.3 Mesh Interface presents the iMesh network that is used to access the memory system. Details of the host controller and the device endpoint channels are described in 10.4 Host Controller and 10.5 Device Endpoint. 10.6 Standalone Device Operation describes the standalone endpoint system design. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 167 Chapter 10 USB Interface 10.2External I/O Interface The USB system connects to the PHY through the chip I/O pins and uses the ULPI the interface. As compared to a typical USB 2.0 Transceiver Macrocell Interface (UTMI) PHY package size (of 48 to 56 pins), the ULPI protocol reduces the link to PHY interface to eight signals, since it is optimized for use as an external PHY. A series of ULPI signals are used to manage the external PHY. Table 10-1 lists the ULPI interface signals. Table 10-1. ULPI Interface Signals Signal Name Direction Description Clock PHY to Link Control and data signals are synchronous to Clock. Data I/O Driven low by the Link when in IDLE state. The Link starts a transfer by sending a non-zero pattern. The PHY must assert Dir before using the data bus. A turnaround cycle is required every time that Dir toggles (changes direction from inbound to outbound). DIR PHY to Link Direction of the data bus. By default, Dir is low and the PHY listens for non-zero data from the Link. The PHY asserts Dir to gain control of the data bus. NXT PHY to Link Next data. The PHY drives NXT high to throttle the data bus. STP From Link to PHY Stop data. The link drives STP high to signal that there is an end of its data stream. The Link can also drive STP high to request data bus access from the PHY. A set of the 12-pin chip I/Os is shared by a host controller and the endpoint device. The connection is selected by a chip configuration pin (CONFIG_USB[0]). When CONFIG_USB[0] is deasserted, the port is used by the USB endpoint device. Software can program the connection by disabling the configuration pin and setting the STRP_PIN_DISABLE field in the USB_DEVICE_USB_PORT0_SELECT register and enabling the host controller by setting the HOST_ENABLE field in the same register. Another set of 12-pin chip I/Os is always used by the second host controller. 10.3Mesh Interface The iMesh connects to the USB system and provides: • Access from the Tile Processors to the USB system via loads and stores in the MMIO address space. • Access from the host controller channels to the memory system to manage data transfers. • Interrupt notification to the Tile Processors. 10.3.1 MMIO Interface Tile software communicates with the USB system via loads and stores in the MMIO address space using the QDN network. The MMIO space is comprised of the general system configuration registers, MAC registers, and the RX/TX FIFIO storage. The Response Dynamic Network (RDN) network carries the read data and the write acknowledge responses. 168 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Mesh Interface The physical address (up to 256 KB of offset within the region) in the MMIO loads and stores is formatted as follows: Channel (bit 39, 38) Reserved (bit 37 ~ bit 20) Protection (bit 19, 18) Register Offset (bit 17 ~ bit 0) Table 10-2. MMIO Interface Format Bits Name Description 39:38 Channel Channel. This field is used for access to one of the following: • Channel 0: Endpoint • Channel 1: Host 0 • Channel 2: Host 1 Several unique structure configuration registers are located only at the channel 0 MMIO space though the structures can be only used in the host channels. These include MMIO_ADDRESS_SPACE definition, TLB-related registers, and Hash-forHome configuration registers. For more information about the MMIO Address Space, refer to Table 11, “TILE-Gx Physical Memory Space Descriptions,” on page 2. 37 ~ bit 20 Reserved Reserved 19 ~ bit 18 Protection Specifies the register access privilege level. 17 ~ bit 0 Register Offset Register Offset is defined as follows: • Bit [15:0] Register address • Bit-16 Selecting MAC configuration and status registers by setting the bit. Note that all the MAC register accesses are 4-byte operation, and the non-MAC registers accesses are 8-byte operation. • Bit-17 Selecting Open Host Controller Interface (OHCI) or Enhanced Host Controller Interface (EHCI) MAC registers in the host channels. OHCI MAC registers are addressed with bit [17:16] = 2’b11 while EHCI MAC registers are addressed with bit [17:16] = 2’b01. 10.3.2 Memory Access The host controller systems can access the memory system via the iMesh network. The SDN network provides the reads and writes to/from the cache system, and the RDN carries the read data or write acknowledge responses. The host EHCI controller can generate 64-bit or 32-bit I/O addresses, and the OHCI controller can generate 32-bit addresses. The I/O addresses are the data pointers in the virtual address space, and are translated to the memory address using a TLB structure. There are 16 TLB entries per host channel. In the address translation, an Address Extension Register is used if the controller is in 32-bit address mode. The top bits of the virtual address are ignored in the TLB lookup. Note that the TLB supports all standard TILE-Gx I/O -TLB attributes. For read requests, a 64-byte data is always returned to the system from the memory to signal a normal completion. Data is buffered to provide subsequent data reads, and it is invalidated if there are writes or any MMIO requests to the channel. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 169 Chapter 10 USB Interface For write requests, a maximum of 4-byte data is always sent to the memory system without a need for any write coalescing. A write acknowledgement is required before any transaction completion notifications can be generated. Hardware always performs a memory-fence operation, i.e. all the previous write acknowledgements are received, before processing writes to any different cacheline (64 bytes) data blocks. 10.3.3 Interrupt Interface The USB system generates interrupts that will be delivered to the Tile Processors if the interrupt bindings are enabled. Dedicated interrupt binding can be used for each of the events listed in the USB_DEVICE/HOST_INT_VEC0_W1TC and USB_DEVICE/HOST_INT_VEC1_W1TC). These events include: • MAC interrupt per channel • TLB fault handling for host channels • Internal interface bus error per channel • MMIO configuration error per channel When a MAC interrupt occurs, software must clear the interrupt registers inside the MAC before clearing the system interrupt registers. This prevents generation of spurious interrupts to the Tile Processors. 10.4Host Controller One EHCI and one OHCI interface are implemented in a host controller channel. These interfaces comply with Enhanced Host Controller Interface (EHCI) Specification, Version 1.0 (http:// www.intel.com/technology/usb/ehcispec.htm), and the Open Host Controller Interface (OHCI) Specification, Version 1.0a (ftp://ftp.compaq.com/pub/supportinformation/ papers/hcir1_0a.pdf). Both interfaces support 32-bit addressing, and 8- or 32-bit data transfers, while the EHCI has the 64-bit addressing capability. The EHCI controller provides descriptor and data prefetching for the next USB packets while the current USB packet is still active on the USB bus. After the current USB packet transmission ends, the next packet can immediately go on the bus, because the descriptor and the data are already fetched from the system memory, thus increasing USB throughput. Up to four descriptors and up to 4KB of data (up to 8 Bulk OUT transactions of 512 byte each) can be prefetched. Unused descriptors are discarded at the end of the (micro)frame. The OHCI Controller supports the Keyboard/Mouse Legacy Emulation Interface. 10.5Device Endpoint 10.5.1 Configuration The device endpoint supports one configuration and up to four interfaces, with one alternative interface provided for each interface. In addition to the default Endpoint 0, seven extra sets of endpoint registers are provided. These endpoints can be paired without any restriction for both IN and/or OUT directions for the same logical endpoint number. 10.5.2 MAC Design The MAC is implemented with the Slave-Only mode design. Little user-intervention is required because the software is uncomplicated, although users can use a dedicated master for data processing. The application initiates all data transfers to the memory-mapping RX/TX data storage in the channel. The device acts as a slave to all the data and CSR transfers in the Tile Processors. The device then responds to the application through a dedicated sideband interrupt. 170 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Standalone Device Operation The TxFIFO is written using the MMIO transfers. After one maximum-size data is written, and the IN endpoint status updated, a TxFIFO controller instantiates the data movement for the designated endpoint. For the OUT (outbound) transactions, the data is written to the Receive FIFO, if space is available. All endpoints share a common set of receive FIFOs. An address FIFO is used to track the endpoint number and a flag to differentiate regular data from the eight bytes of SETUP data. The application reads the Endpoint Status register to determine the number of bytes to be transferred, and then initiates the data transfers. 10.5.3 MAC Interrupts The MAC provides a single interrupt signal to indicate that at least one interrupt condition exists, as described in the sections that follow. 10.5.3.1 Device Interrupts The Device Interrupt Register tracks system-level events. The application clears the interrupt by writing a 1’b1 to the correct bit. A Device Interrupt Mask Register can mask the designated interrupt. These events are: SC The device has received a Set_Configuration command. SI The device has received a Set_Interface command. ES An idle state is detected on the USB for duration of 3 milliseconds. UR A reset is detected on the USB. US A suspend state is detected on the USB for duration of 3 milliseconds, following the 3-millisecond ES interrupt activity due to an idle state. SOF An SOF token is detected on the USB. ENUM Speed enumeration is complete. RMTWKP STATE INT A Set/Clear Feature (Remote Wakeup) is received by the core. 10.5.3.2 Endpoint Interrupts The Endpoint Interrupt Register tracks the endpoint-level interrupts. Since all eight endpoints can be bidirectional, each endpoint has two interrupt bits (one for each direction). An Endpoint Interrupt Mask Register can mask the designated interrupt. The following events are categorized as endpoint-related events: • Reception of a request for IN data • Reception of an OUT data packet • Reception of eight bytes of SETUP data packet • An application error resulting in an internal • Advanced High-Performance Bus (AHB) Error Response. 10.6Standalone Device Operation The device endpoint system can be operated without software intervention. The special mode operation is activated by the chip strap pin USB_CONFIG[0], and can be disabled by the software by changing the USB0 port ownership to the host controller or setting the DISABLE field in the USB_DEVICE_CFG_STANDALONE_DEVICE_CONFIG register. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 171 Chapter 10 USB Interface The device endpoint system provides the chip boot and debugging capabilities, including handling standard USB host requests, interrupts, and data transfers between the host and Rshim. By default, no device or endpoint interrupts are forwarded to the Tile Processors during the process. 10.6.1 Interface and Endpoint Configuration In the standalone device operation, the device is designed to have one configuration and two (boot/debug and tile-monitor) interfaces. Four endpoints in addition to the default Endpoint 0 are used. Table 10-3 summarizes the configuration. Table 10-3. Standalone Device Configuration Interface Number Interface Name Hard-Coded Endpoint Number Endpoint Type 1 Boot/Debug 1 Bulk Outa 2 Tile-Monitor 2 Interrupt In 3 Bulk Out 4 Bulk In a. Refer to the USB specification for the Bulk and Interrupt Endpoint definition. After the connection is established, the external USB host can access all device, configuration, string, interface, endpoint, device_qualifier and other_speed_configuration descriptor information. 10.6.2 Boot/Debug Interface There are two endpoints in the boot/debug interface. Endpoint 0 is a control endpoint that can access the Rshim registers with the Tilera-specific command, described in Table 10-4. Table 10-4. Format of Setup Data (see USB Specification Revision 2.0 Table 9-2) Offset Field Size (Byte) Value Description 0 bmRequestType 1 Bitmap D7: direction D6…5: Type (2’b10: Vendor) D4…0: Recipient 1 bRequest 1 Value Rshim Command (8’b0) 2 wValue 2 Value Rshim Register Channel (4-bit) 4 wIndex 2 Index or Offset Rshim Register Index (16-bit) 6 wLength 2 Count Data bytes to transfer (must be 8 bytes) For the boot operation, the host controller first uses Endpoint 0 to read the specific Rshim register (RSH_PG_CTL) to calculate the maximum amount data that can be transferred. The actual data is moved using Endpoint 1, a Bulk OUT endpoint, from the host controller to the chip. The boot data is required to be 8-byte multiples in each data transfer, and the target Rshim register is always RSH_PG_DATA. The debug operation uses the Endpoint 0 as well to read and write Rshim registers. The USB logic guarantees that any read or write requests through Endpoint 0 are delivered to the Rshim with higher priority than the requests generated by Endpoint 1. 172 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Standalone Device Operation 10.6.3 Tile-Monitor Interface There are two data buffers implemented in Rshim for the Tile-Monitor data movement. The host controller reads from Rshim RX data buffer, and writes to the TX data buffer to communicate with the Tiles. The flow control between the host controller and Rshim is managed by the following registers: • RSH_TM_HOST_TO_TILE_STS should be read from the host controller through the Endpoint 1 to calculate how much data it can send to Rshim. • RSH_TM_HOST_TO_TILE_DATA is the target register for the USB device to deliver the data. • RSH_TM_TILE_TO_HOST_STS should be read from the host controller through the Endpoint 3 to calculate how much data it can receive from Rshim. • RSH_TM_TILE_TO_HOST_DATA is the source register for the USB device to request data. For data moving from the Tiles to the host controller, Endpoint 2 is the Interrupt IN endpoint to poll the RSH_TM_TILE_TO_HOST_STS periodically. If the read succeeds and there exists data to be transferred, the endpoint returns the number of bytes back to the host, otherwise the endpoint returns NACK. The USB device fetches data (less or equal to 512 bytes) from Rshim next time when the host controller makes an Endpoint 4 Bulk IN request. Rshim uses NACK to indicate the data fifo empty condition. For data moving from the host controller to the Tiles, the host controller uses the control Endpoint 0 to access the RSH_TM_TILE_TO_HOST_STS to calculate the data (must be 8-byte multiples in each data transfer) to be sent. The host controller then uses Endpoint 3 Bulk OUT requests to deliver data. The USB logic guarantees any read or write requests through Endpoint 0 or 2 are delivered to Rshim with higher priority than the requests generated by Endpoint 3 and 4. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 173 Chapter 10 USB Interface 174 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 11 (M I CA) C OMMON A CCELERATOR I NTERFACE 11.1 Introduction This chapter describes the architecture of the TILE-Gx™ Multicore iMesh Coprocessing Accelerator (MiCA™). The MiCA provides a common front-end (both SW and HW) to various IO off-load or acceleration functions, for example Crypto or Compression. The MiCA performs operations as specified by an opcode on data in memory. The exact set of operations that it performs is dependent on the specific MiCA implementation. For instance, the TILE-Gx Crypto Engine is built around the MiCA architecture with the set of operations it can perform specified as cryptographic operations. The MiCA uses a memory-mapped I/O (MMIO) interface for the array of tiles. Because it uses the MMIO interface, access to the control registers can be controlled through the use of in-tile TLBs. The memory mapped interface enables tiles to instruct the MiCA to perform operations from user processes in a protected manner. Memory accesses performed by the MiCA are validated and translated by an I/O-TLB, which is located in the MiCA. This allows completely protected access for operations that user code instructs the MiCA to execute. The MiCA connects to TILE-Gx’s memory networks and processes requests, which come in via its memory mapped I/O interface. A request consists of a Source Data Descriptor, a Destination Data Descriptor, a Source Data Length, an Operation to perform, and an optional Pointer to Extra Data (ED). Many requests can be in flight at one time as the MiCA supports a large number of independent Contexts, each containing their own state. An operation is initiated by writing the request parameters to a Context’s User registers.1 As the operation progresses, the MiCA verifies that the memory that is accessed by the operation can be accessed legally. If the operation instructs the MiCA to access data, which is not mapped by the Context’s I/O-TLB, a TLB Miss interrupt is sent to the Context’s bound tile. It is the responsibility of the tile I/O-TLB miss handler to fill the I/O-TLB. At the completion of the operation the MiCA sends a completion interrupt to the Context’s bound tile. Because the MiCA is multi-contexted, multiple operations can be serviced at the same time. Each MiCA implementation has some number of processing engines (for example, crypto, compression, etc.) and a Scheduler, which assigns requesting Context’s to those engines. All Contexts are independent from each other. Under typical operation, a Context is allocated to a particular tile and that tile instructs operation of the Context. A Context is not multi-threaded. If a tile needs overlapped access to a MiCA accessible accelerator, multiple Contexts can be utilized by a single tile. 1. This is the Context that is used at the same level as Engine and Scheduler. For a definition of Context refer to “Glossary, Conventions and Standards” on page 585. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 175 Chapter 11 Common Accelerator Interface (MiCA) This chapter is organized with the MiCA common features defined in the main text, and implementation specific details in other chapters (for Crypto and Compression implementations). For additional information, see Chapter 12: Cryptographic Accelerator Interface or Chapter 13: Compression Accelerator Interface. 11.2 Overview and Major Functional Blocks Figure 11-1 shows a high level block diagram for the MiCA. Descriptions of each of the sub-blocks are provided in the following section. MMIO Registers and Context State Mesh Interface Context Registers TLB Global Registers Context Specific State Network Interfaces PAs VAs Context Assignments Engine Assignments Engine Scheduler Engine Status Read Data RDN MMIO Read Data and Write Acks/ IPI Interrupts Write Data Read or Write Requests QDN MMIO Requests Operation Requests To/From Tiles Read Requests SDN Memory Read and Write Requests PA to Route Header Generation Egress DMA (From Memory to MiCA) Read Data Notification Engine Front End Write Requests RDN Memory Read Data and Write Acks Ingress DMA (From MiCA to Memory) Write Data Notification Function-Specific Engines (For example Crypto or Compression) Figure 11-1: MiCA Block Diagram 176 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Overview and Major Functional Blocks 11.2.1 Major Blocks The next sections describe the MiCA using Figure 11-1 as a reference point. 11.2.1.1 Mesh Interface The MiCA interfaces to the Tiles via the mesh interface. The connections are: • QDN In – Receives Tile MMIO accesses to MiCA. • RDN Out – MMIO read data and MMIO write acks from MiCA to tiles. Also MiCA sends IPI interrupts to Tiles. • SDN Out – MiCA to memory read requests and memory write requests and write data. • RDN In – Memory read data and write acks to MiCA. 11.2.1.2 MMIO Registers and Context State Tile access and control of the operation of the MiCA is provided via a set of memory mapped registers. Tiles access the registers via MMIO writes and reads to setup operations and check status. Address Space The MiCA physical address is partitioned as described in Table 11-1 and illustrated in Figure 11-2 (note that the MiCA is selected by its x/y mesh coordinate). 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 Register Partition Usage of the Bits by Partition Select 9 8 7 6 5 4 3 2 1 0 Byte Offset 10 Context Number Register Number Must be 0.1 Context User Space 11 Context Number Register Number Must be 0.1 Context System Space 00 01 Register Number Engine Number Register Number Must be 0.1 Global Space Must be 0.1 Engine Access Space Figure 11-2: MiCA Physical Address1 Note: The specific Hypertext links provided in the text that follows are to the compression instance of MiCA registers. There is a corresponding register in the crypto instance of the MiCA registers. Context Registers The specific number of Contexts supported is defined by each MiCA implementation. Each Context has two distinct sets of registers for which different protection levels can be assigned, typically for User and System space. Source Descriptor The Source Descriptor is defined in “Source Data” on page 181. 1. The Byte Offset is zero, because only 8-byte accesses are allowed. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 177 Chapter 11 Common Accelerator Interface (MiCA) Table 11-1. MiCA Physical Address Bits Name Description 25:24 Register Partition Note that the partitions are spaced 16MB apart. This provides access control; if the entire user partition is mapped into a 16MB page, that page will not also include any system partition registers. 10 Context User Space 11 Context System Space 00 Global Space 01 Engine Access Space Context User Space and Context System Space 23:14 Context Number A 10-bit block allows for up-to 1k Contexts (the actual number is configurable per design instance of a given MiCA instantiation; a typical usage is to select that number based on the number of Tiles in a given chip). Each Context is aligned to 16kB address boundary; this allows each Context to be protected via Tile memory management on a 16kB page or four Contexts to be grouped in a 64kB page. 13:3 Register Number These 11 bits allow for 2k registers, but many fewer are defined. Note: Writes to unused addresses are dropped; reads from unused addresses return 0x0. 2:0 Byte Offset All registers are 8-bytes and must be written via 8-byte store instructions and read via 8-byte load instructions. Global Space 23:3 Register Number These 21 bits allow for 2M registers, but many fewer are defined. Note: Writes to unused addresses are dropped; reads from unused addresses return 0x0. 2:0 Byte Offset All registers are 8-bytes and must be written via 8-byte store instructions and read via 8-byte load instructions. Engine Access Space 23:18 Engine Number These six bits allow for up-to 64 Engines. The actual number is configurable per design instance of a given MiCA instantiation. 17:3 Register Number These 15 bits allow for 32k registers, but many fewer are defined. Note: Writes to unused addresses are dropped; reads from unused addresses return 0x0. 2:0 Byte Offset All registers are 8-bytes and must be written via 8-byte store instructions and read via 8-byte load instructions. Destination Descriptor The Destination Descriptor is defined in “Destination Data” on page 182. Extra Data Pointer The Extra Data Pointer is optional, depending on the operation being performed. It is defined in MICA_COMP_CTX_USER_EXTRA_DATA_PTR. Operation Length The Operation Length is defined in “Destination Data” on page 182. See also MICA_COMP_CTX_USER_OPCODE. OPCODE OPCODE consists of the following fields. Refer to MICA_COMP_CTX_USER_OPCODE. 178 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Overview and Major Functional Blocks • Engine Type for the requested operation. The ENGINE_TYPE field defines which type of Engine in the MiCA should perform the operation. Types 0 and 1 are common for all accelerators; Types 2 through 7 are defined uniquely for each accelerator. • Source Mode • • Single Buffer Descriptor • List of eDMA Descriptors Destination Mode • Single Buffer Descriptor • List of Buffer Descriptors. If this mode is selected, the number of Buffer Descriptors in the list is also specified. • Overwrite Source Data • Extra Data Size (number of 8-byte words) • Destination Size. This field is dependent on the engine type, and specifies the destination size as a function of source size. • The MICA_COMP_CTX_USER_OPCODE can also contain some fields specific to a given MiCA implementation, based on its capabilities. If so, they are described with the engine-specific information (for example in Chapter 12: Cryptographic Accelerator Interface). Note: Operation length and opcode are packed into one register; this allows a complete operation to be specified in four MMIO writes (versus five). • In_Use – reads the value from COMP_PENDING bit of the MICA_COMP_CTX_USER_CONTEXT_STATUS register. • • Allows User to poll for completion instead of receiving an interrupt. User Status • Number of bytes written to destination. • Error status bits. Context System Registers • Completion Interrupt Binding register (MICA_COMP_CTX_SYS_COMP_INT). • TLB Miss Interrupt Binding register (MICA_COMP_CTX_SYS_TLB_MISS_INT). • Interrupt Mask registers: • Interrupt Mask register (MICA_COMP_CTX_SYS_INT_MASK) • Interrupt Mask Set register (MICA_COMP_CTX_SYS_INT_MASK_SET) • Interrupt Mask Reset register (MICA_COMP_CTX_SYS_INT_MASK_RESET). • Miss Virtual Address (VA) register (MICA_COMP_CTX_SYS_MISS_VA). • TLB Table (16 entries per Context) register (MICA_COMP_CTX_SYS_TLB_TABLE). • Probe VA register (MICA_COMP_CTX_SYS_PROBE_VA). • TLB Probe Status register (MICA_COMP_CTX_SYS_PROBE_STATUS). • Control register (MICA_COMP_CTX_SYS_CONTROL). • System Status register (MICA_COMP_CTX_USER_CONTEXT_STATUS). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 179 Chapter 11 Common Accelerator Interface (MiCA) Figure 11-3 shows the set of states that Context can be in, as well as the transitions between the states. RUN W AIT RUN RESET ID LE PAU SE RW PAU SE ID LE R ESET W AIT Figure 11-3: Context States Legend 180 State Description IDLE No operation in progress. Context User registers may be written to setup a new operation. RUN WAIT Context User registers have been written, operation is waiting to be assigned to an engine. RUN Operation has been assigned to an engine and is running. This is the only state in which the Context will access the TLB and initiate memory accesses. RESET WAIT Control register RESET bit was written to 1, engine is waiting for in-flight memory accesses to complete. PAUSE IDLE Control Register PAUSE bit was written to 1 no operation has been requested. PAUSE RW Control Register PAUSE bit was written to 1, there is also an operation requested. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Overview and Major Functional Blocks Global Registers Global registers refer to those registers in the Global Register Space, as illustrated in Figure 11-1 on page 176. These registers are common to all I/O devices and are described in “Device Discovery” on page 3. 11.2.1.3 Operand Data Specification Each MiCA operation reads Source Data, operates on it, and then writes it out as Destination Data. It can optionally read and/or write Extra Data. The next sections define the details of the operands. Extra Data The MiCA architecture allows users to specify Extra Data that will be needed by an operation. Some operations done in MiCA might not need any Extra Data. For example, encryption keys are specified in Extra Data; a memory-to-memory copy does not use any Extra Data. When used, the Extra Data is specified by its Virtual Address in the Extra Data Ptr (MICA_COMP_CTX_USER_EXTRA_DATA_PTR) register and a length, which is contained in the MICA_COMP_CTX_USER_OPCODE register. The length is defined as number of 8-byte words, which must be padded with zeroes if the actual data used is not a multiple of 8-byte words. Source Data Source data can be specified as either a Multicore Programmable Intelligent Packet Engine (mPIPE™) Buffer Descriptor (refer to Section 6.2.2 Buffers on page 94), or a list of mPIPE eDMA Descriptors (refer to the “eDMA Descriptor Format” section), as determined by the SRC_MODE field of the MICA_COMP_CTX_USER_OPCODE register. • 0 = Single Buffer Descriptor. Source Data register (MICA_COMP_CTX_USER_SRC_DATA) contains the Buffer Descriptor bitfield (BUFFER_DESC), which can be either chained or unchained. Note that the MICA_COMP_CTX_USER_OPCODE register SIZE field specifies the total source data length. • 1 = List of eDMA Descriptors. Source Data register contains a VA pointer (VA) to a list of mPIPE eDMA Descriptors. The maximum size of the list is four eDMA Descriptors, and the pointer must be cacheline-aligned. eDMA Descriptor Format The eDMA Descriptor format is the same as the one mPIPE uses, although some fields used by mPIPE do not apply to the MiCA and are ignored. See “eDMA Packet Descriptors” on page 124. for more information. MiCA uses only the Size, Bound (Boundary bit), and Buffer Descriptor fields. Refer to Figure 6-3 on page 95 for more information on these fields. The MiCA Source Mode can be either set to a single buffer descriptor or to a list of buffer descriptors. Figure 11-4: Using a List of eDMA Descriptors as a MiCA Source Mode" on page 182 shows the case where the Source Mode is a list of eDMA descriptors. Each eDMA descriptor in the list contains two sub-fields: a Size field and a Buffer Descriptor field. There can be up to four eDMA descriptors in the list. In the example depicted in Figure 11-4, the first eDMA Descriptor points to a buffer chain. (Please refer to 6.2.2.2 Buffer Chaining). The other three eDMA Descriptors can point to different buffer chains. The Source Descriptor (Src Desc) should be assigned the VA of the list (array) of up to four eDMA Descriptors. MiCA can process one, two, three, or four eDMA Descriptors in one operation (note that processing a single eDMA Descriptor is not illegal, but can be performed more efficiently using the Single Buffer Descriptor source mode). If the list has four eDMA Descriptors, the Boundary bit is implied as set on the fourth descriptor. Having a Size of 0 in any of the eDMA Descriptors results in an error. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 181 Chapter 11 Common Accelerator Interface (MiCA) MiCA Context User Registers Src Desc Dest Desc Memory Memory Buffer Descriptor Size (and other fields) Buffer Descriptor ... Buffer Descriptor Buffer-0 Buffer-n eDMA Descriptor eDMA Descriptor eDMA Descriptor Buffer Chain eDMA Descriptor Figure 11-4: Using a List of eDMA Descriptors as a MiCA Source Mode 1 Fields of the Buffer Descriptor used by the mPIPE that do not apply and are ignored are: • Gen – Generation number. • CSUM – Checksum generation enabled. • NS – NoSend. • CSUM_START – Start byte of checksum. • CSUM_DEST – Destination of checksum. • Notif – Notification interrupt. • StackIDX – 52:48. MiCA does not manage stacks of buffers as mPIPE does. • Format of HWB. MiCA does not release buffers as mPIPE does. Unused fields must be 0. For more information about the Buffer Descriptor, refer to “iDMA Packet Descriptors” on page 100. Destination Data Destination data can be specified as either a mPIPE Buffer Descriptor (refer to Section 6.2.2 Buffers on page 94) or a list of mPIPE Buffer Descriptors, as determined by DST_MODE field of the MICA_COMP_CTX_USER_OPCODE register. • 0 = Single Buffer Descriptor. Destination Data register (MICA_COMP_CTX_SYS_CONTROL or MICA_COMP_CTX_USER_OPCODE) contains the Buffer Descriptor field (BUFFER_DESC). This descriptor can be either unchained or chained but large enough to hold all the destination data; specifying a chain of buffers is illegal. • 1 = Overwrite Source Buffers. MICA_COMP_CTX_USER_DEST_DATA register is not used, the destination data is written into the Source Buffers. 1. See also 6.5.2 eDMA Packet Descriptors. 182 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Overview and Major Functional Blocks • 2 = List of Buffer Descriptors. MICA_COMP_CTX_USER_DEST_DATA register contains a VA pointer to a list of Buffer Descriptors. The number of descriptors in the list is specified in the DEST_MODE bitfield of the MICA_COMP_CTX_USER_OPCODE register, and has a maximum size 32. The pointer must be cacheline-aligned. MiCA treats each buffer descriptor in the list as a single buffer only. If the Chain field is Chained, the Size field specifies the size of the buffer and the next descriptor is ignored. If the field is Unchained, then the Size field is ignored. Note that the Operation Length register (MICA_COMP_CTX_USER_OPCODE) specifies the total source data length. The destination data length is determined by the operation, and can be equal to, greater than, or less than the source data length. If it is greater than the source data length, the DEST_SIZE field of the MICA_COMP_CTX_USER_OPCODE register indicates how much additional space is available. Figure 11-5 shows the case where the Destination Mode (Dest Mode) is set to a list of buffer descriptors. Dest Desc should be assigned the VA of a list (array) of up to 32 Buffer Descriptors, as specified by the number of destination buffer descriptors (NUM_DEST_BD). Each Buffer Descriptor points to a corresponding buffer in the chain. MiCA Context User Registers Src Desc Dest Desc Memory Memory Buffer Descriptor Buffer Descriptor 0 C=1 Buffer Descriptor ... 1 Buffer-0 Buffer-31 2 3 ... 31 Figure 11-5: Using a List of Buffer Descriptors as a MiCA Destination Mode The array can contain up to 32 buffers. The status register will report the number of destination data bytes written, from which the number of used buffers can be determined. The chaining field in the Buffer Descriptors must be set to 1. Otherwise, the MiCA output will be written to the corresponding buffer irrespective of the size specified in the Buffer Descriptor. 11.2.1.4 TLB (Translation Lookaside Buffer) The TLB is used to store VA-to-PA (Virtual Address-to-Physical Address) translations. It is partitioned per Context, with each Context having 16 entries. Tiles write to and read from the TLB and can initiate probes to it. The MiCA performs lookups in TLB. • MMIO accesses – Tiles read and write entries in TLB. • Translation lookups are done by MiCA at operation startup and when data buffers cross page boundaries. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 183 Chapter 11 Common Accelerator Interface (MiCA) 11.2.1.5 Engine Scheduler Scheduling consists of assigning Contexts, which have operations to perform, to hardware resources to perform them (for example, the Function Specific Engines). This function is necessary because there are many more Contexts than Engines. Once an Engine is scheduled to a Context, it completes the operation; that is, without being time-shared within an operation. 11.2.1.6 Function Specific Engines There are multiple Processing Engines, each one is capable of performing a given algorithm, for example, for Crypto AES, DES, MD-5, SHA, etc. The exact list and number of instances of each type is specific to each MiCA implementation, and typically is much lower than the number of Contexts. 11.2.1.7 DMA Channels The DMA Channels move data between memory and the Function Specific Engines. Egress DMA is for reading data from memory (packet data and related operating parameters), and Ingress DMA is for writing packet data to memory (note that this convention is the same as for the mPIPE, where Ingress packets travel from external interface into memory, and Egress packets travel from memory to the external interface). Each Engine has dedicated DMA Egress and Ingress channels assigned to it, so that no Engine is blocked by any other. 11.2.1.8 PA to Header Generation This block takes the physical address and page attributes from a DMA channel and converts that into a route header to pass to the Share Dynamic Network (SDN) Out mesh interface. 11.3 Operation Flow This section describes the high-level flow of an operation through the MiCA, followed by some specific operation details. 11.3.1 General Flow 1. Tile software or hardware puts source data in memory. For example, the data could be a packet received by the mPIPE. 2. Tile software allocates memory for destination data. 3. Tile software puts extra data, if needed, in memory. 4. Tile software writes parameters describing the operation into its allocated Context Registers. 5. The Context requests use of an Engine from the Scheduler. 6. When an Engine is available, the Scheduler assigns a waiting Context to it. 7. The Engine reads operation parameters from the Context’s registers. 8. The Engine accesses data from memory, in the following order: a. If the Source Mode is a List of eDMA Descriptors, read the list. b. If the Destination Mode is a List of Buffer Descriptors, read the list. c. If Extra Data is required, read the Extra Data. d. Read Source Data. For more information refer to 6.2.2 Buffers on page 94. 184 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Operation Flow 9. Source data is processed (for example, encrypted/decrypted, compressed/decompressed, etc), and output is written to Destination. Note: Engine performs TLB lookups to translate VAs, as needed. TLB misses generates an IPI interrupt to the Tile (if not masked). Only one TLB Miss will occur at a time (subsequent lookups do not happen under a miss. The Engine overlaps the next TLB lookup with current data access. The Engine continues reading source data and writing destination data until the operation is complete. 10. Set the COMP_PENDING bitfield of the MICA_COMP_CTX_USER_CONTEXT_STATUS register and send an IPI interrupt (if not masked). The interrupt also acts as a memory fence; it is not sent until the destination data is visible in memory. Note that Contexts request specific types of operations by specifying the Engine Type (ENGINE_TYPE field) in the MICA_COMP_CTX_USER_OPCODE register and the Scheduler assigns the specific engine (step 5 above). This allows for a MiCA implementation to include multiple copies of a given engine for higher processing throughput. Each engine also has a number, by which it can be accessed for system operations — for example, performance monitoring, reset, etc. In general the specific engine number does not correlate to Engine Type. The Engine Number is not significant to the Context. 11.3.2 Tile Interrupts Interrupts can be sent for the following reasons: • TLB Misses • Normal Completion Each of the interrupt types has its own Bound Tile, as specified in Context System registers. Each also has a Pending bit in the Context System MICA_COMP_CTX_USER_CONTEXT_STATUS register, and a MASK bit in the Context System Interrupt Mask register (MICA_COMP_CTX_SYS_INT_MASK). This combination allows for polling interrupt usage, and also for temporarily deferring interrupts. The operation is: 1. The appropriate TLB_MISS_PENDING bit is set during the course of the operation, for example if a TLB Miss occurs or the operation completes. a. When the Pending bit, either the COMP_PENDING or TLB_MISS_PENDING bit, is a 1 and the associated Interrupt Mask bit is 0, an IPI interrupt will be sent to the Bound Tile. Note that the IPI is sent via the Response Dynamic Network (RDN), which is also used for MMIO read responses and MMIO write acks. This means that a MMIO response will be ordered behind an earlier IPI that had been sent. b. If the MASK bit is a 1 the IPI interrupt will not be sent. c. If the MASK bit is written from 1 to 0 the pending interrupt will be sent. The MASK bits can be written directly at the Interrupt Mask register (MICA_COMP_CTX_SYS_INT_MASK) address or set/reset by writing a 1 to the Interrupt Mask Set (MICA_COMP_CTX_SYS_INT_MASK_SET)/Interrupt Mask Reset register (MICA_COMP_CTX_SYS_INT_MASK_RESET) address, respectively. The interrupt types are independent, but there are some dependencies: • Normal Completion cannot happen if there is a TLB Miss outstanding. The operation cannot complete without the required translation. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 185 Chapter 11 Common Accelerator Interface (MiCA) Note: Strictly speaking, if the TLB Miss is prefetching the next translation, it is not required for the operation. However, the TLB Miss must still be acknowledged in order for the operation to complete. • 2. There will only be one TLB Miss outstanding at a time (there is only one MICA_COMP_CTX_SYS_MISS_VA register per Context). After Completion interrupt occurs there will be no subsequent TLB Miss interrupts. The Tile must dismiss the interrupt. This can be done by writing a 1 to the interrupt’s Pending bit in the Context System Register. Alternately, all of the Pending bits are cleared by writing the MICA_COMP_CTX_USER_OPCODE register to start the next operation (this in effect implicitly acknowledges the operation completion, and saves an MMIO write which would be done to dismiss the Completion interrupt). TLB Miss interrupt Pending bit is also cleared by filling a TLB entry when the TLB Miss Ack bit (TLB_MISS_ACK) value is 1 in the write of the entry’s TLB Attributes (this also implicitly clears the TLB Miss interrupt bitfield (MICA_COMP_CTX_SYS_TLB_MISS_INT), refer to “TLB Miss”). 11.3.3 Specific Use Examples The next sections give some examples of specific uses. This list is not exhaustive; other uses not covered here are also possible. 11.3.3.1 General Use Contexts are assigned to Tile User level processes; the mechanism for doing that is beyond the scope of this discussion. It is assumed that as part of the setup process the Context User Registers for the assigned Context are mapped into the User’s virtual memory space and the Context System Completion Interrupt Binding register (MICA_COMP_CTX_SYS_COMP_INT) is set to interrupt the User process. The User can then directly initiate operations and receive Completion Interrupts. The TLB Miss Interrupt Binding is set to interrupt System-level software. An operation is initialized by writing the operation parameters into the Context User registers. The write of the MICA_COMP_CTX_USER_OPCODE register triggers the operation to start. Note: Once the MICA_COMP_CTX_USER_OPCODE register is written, the operation parameters should not be written again until the operation completes, as determined by the Context sending a completion interrupt. The parameters are Source and Destination Descriptors, a virtual address pointer to extra data (if needed, for example Crypto parameters like keys, etc.), and the opcode and length of the operation. Normal completion is reported by a Completion Interrupt. When the destination data size is known beforehand, the User Process does not need to read any status from the Context in response to a Completion Interrupt. When the destination data size is not known, the User Process reads the MICA_CONTEXT_STATUS register to get that information. After it sends the Completion Interrupt, the Context is ready to accept the next operation. Note that completion can also be determined by polling, if desired. TLB Misses are normally bound to System-level software and are transparent to the User level. 11.3.3.2 TLB Miss During normal processing operations, the Engine looks up VA-to-PA translations in the associated Context’s partition of the TLB (note that the TLB is partitioned such that each Context has its own set of entries). If a translation is not found, it will set the Context’s TLB Miss Interrupt Pending status bit. Normally the TLB Miss interrupt will not be masked and therefore an interrupt will 186 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Operation Flow be sent to the Bound Tile. Alternatively, software can mask the interrupt and poll the TLB Miss Interrupt Pending bit (TLB_MISS_PENDING). In either case the software follows the steps described below in response to a TLB Miss. 1. Read the Context’s MICA_COMP_CTX_SYS_MISS_VA register, to find the VA that was not found in TLB, and also a suggested TLB Index to load with the new translation. 2. Access the page table to find the appropriate PA and Attributes. This task is software-dependent, and not in the scope of this document. 3. Write the Context’s TLB Address register for the entry being replaced. This write clears the Valid bit for the entry (the entry spans two registers, so this action makes the update free of races). 4. Write the Context’s TLB Attributes Register for the entry being replaced. Writing this register with TLB Miss Ack bit (TLB_MISS_ACK) = 1 also clears the TLB Miss Pending Interrupt bit (TLB_MISS_PENDING), which implicitly dismisses the TLB Miss interrupt. Note that translations are done ahead of time, in parallel with data processing, so most times the operation will not be stalled waiting on the translation. If the operation did become stalled, dismissing the interrupt will unstall it. The Engine does not swap to another Context operation when stalled, so TLB misses should be dismissed quickly to minimize lost Engine throughput. 5. After the TLB is filled, the Engine will repeat the lookup that missed. If for some reason it is not possible for software to complete the TLB fill, the operation must be terminated as described in “Terminate Operation for a Specific Context” on page 188. 11.3.3.3 Deferred Interrupts When a User Process is swapped out of a Tile, it will not be available to receive interrupts. Note that in this case the User Process still wants the operation to proceed (so it will not be terminated or stalled), but simply defers sending a Completion Interrupt. In some instances, System software might need to defer TLB Miss Interrupts. The MICA_COMP_CTX_SYS_INT_MASK register enables the ability to perform this function. The Mask Register has three alias addresses – Interrupt Mask (MICA_COMP_CTX_SYS_INT_MASK), Interrupt Mask Set (MICA_COMP_CTX_SYS_INT_MASK_SET), and Interrupt Mask Reset (MICA_COMP_CTX_SYS_INT_MASK_RESET). Using the set and reset addresses eliminates the need to do a read-modify-write to the register. The hardware keeps masked interrupts in a pending state until it can deliver them when the mask is cleared, as described in the Section 11.3.2 Tile Interrupts. Note that if the operation completes while the Completion Interrupt is masked, the Engine is freed up for a new operation. However, if a TLB Miss occurs while TLB Miss Interrupt is masked, the Engine still waits for the TLB Miss to be serviced (for example, the Engine does not swap to another Context operation). In general, software should be aware that deferring TLB Miss service can have a negative impact on Engine throughput. 11.3.3.4 Pause Context The following sequence can be used by software to pause a Context for a period of time. Normally System software will do this when it wants to remap memory. Once it is paused, no memory accesses or TLB lookups will be initiated for the Context.‘ Note: the 1. Write a 1 to the Pause bit in the Context’s Control register (MICA_COMP_CTX_SYS_CONTROL). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 187 Chapter 11 Common Accelerator Interface (MiCA) 2. If the Context has not been programmed for an operation, or was programmed but not yet assigned to an Engine, it will go to Paused state immediately. 3. Poll the Context’s Status Register state field until the Context state is Paused Idle or Paused Run Wait. If the Context was already assigned to an Engine it will complete the operation and then go to Paused Idle. Refer to Figure 11-3. 4. The Status register (MICA_COMP_CTX_USER_CONTEXT_STATUS) also indicates if the Context has any pending interrupts. If it does and the interrupt was unmasked, the software must wait for the interrupt to be delivered. 5. Write a 0 to the PAUSE bit of the MICA_COMP_CTX_SYS_CONTROL register to allow the operation to continue. The sequence described above can be done regardless of the state of the operation, that is, the PAUSE bit can be written whether or not an operation has been requested. If one was requested, the PAUSE bit can be written whether it is queued waiting for an Engine, or already assigned to an Engine. This allows the process that requests operations and the process that manages remapping memory to operate independently. 11.3.3.5 TLB Probe The TLB can be probed by software to check for the presence of a given translation. 1. Write the VA to be checked to the Context’s MICA_COMP_CTX_SYS_PROBE_STATUS register. 2. Read the hit/miss status and index in the Context’s TLB MICA_COMP_CTX_SYS_PROBE_STATUS register. This operation can be done regardless of the Context’s state. 11.3.3.6 TLB Shootdown When operating system software needs to move or remove pages of Physical Memory it must coordinate with agents (including Tile and non-Tile agents) that have copies of translations in the TLBs. The following sequence is used to coordinate removing entries from the TLB and getting an acknowledgment that the operation has completed. 1. Pause the Context as described in “Pause Context” above. This is necessary to insure that all in-flight memory operations that might be using the affected pages are completed. 2. Remove the entry(s) affected. For each page: a. Write the VA to the Context’s Probe register. b. Read the hit/miss status and index in the Context’s MICA_COMP_CTX_SYS_PROBE_STATUS register (PROBE_STATUS). c. If hit, write the Context’s TLB Attribute register (TLB_ENTRY_ATTR) of the index that hit; writing the Valid bit (VLD) to 0. d. An alternative to probing is to invalidate all of the Context’s TLB entries by writing each entry’s valid bit to 0. 3. Un-pause the Context to allow it to continue by writing a 0 to the PAUSE bit of the MICA_COMP_CTX_SYS_CONTROL register to allow the operation to continue. 11.3.3.7 Terminate Operation for a Specific Context When a process running on a Tile is terminated prior to completion, for example due to some error, any I/O operation associated with it should also be terminated. The following sequence is used to terminate the operation for a Context owned by that process: 188 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Operation Flow 1. Write a 1 to the Context’s MICA_COMP_CTX_SYS_CONTROL register RESET bit. 2. Poll the Context’s MICA_COMP_CTX_USER_CONTEXT_STATUS register state field until the Context state is IDLE. a. If the Context has not been assigned to an Engine before RESET bit of the MICA_COMP_CTX_SYS_CONTROL register is written, it will go to IDLE state immediately and not be assigned to an Engine. b. If the Context has been assigned to an Engine prior to RESET being written, it will terminate any operation as quickly as possible, but might need to wait for in-flight memory operations to complete before going to IDLE state. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 189 Chapter 11 Common Accelerator Interface (MiCA) 190 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 12 I NTERFACE C RYPTOGRAPHIC A CCELERATOR This chapter describes the TILE-Gx Crypto implementation of Multicore iMesh Coprocessing Accelerator (MiCA™). 12.1 Engines There are five engines in Crypto MiCA: • One Memory-to-Memory Copy Engine, described in “Memory-to-Memory Copy Engine” on page 191 • Two Tilera Crypto Packet Processors, described in “Crypto Packet Processor” on page 192 • One Tilera KASUMI core and Tilera SNOW-3G core, described in “KASUMI and SNOW-3G Engine” on page 193 • One Tilera Public Key Accelerator (PKA), described in “Public Key Accelerator Engine” on page 195 12.2 Schedulers The Memory-to-Memory Copy, Crypto Packet Processor, and KASUMI/SNOW-3G Engines are each scheduled by their own Scheduler. All the schedulers have four priority levels. Scheduling fixed priority across levels and using the round-robin policy within each level with programmable timers guarantees that lower levels do not get “starved”. 12.3 Contexts The Crypto MiCA supports forty Contexts. 12.4 Engine-Specific Details The next sections provide unique details of each of the Engines. 12.4.1 Memory-to-Memory Copy Engine Memory-to-Memory copy operations are specified by Engine Types 0 and 1 in the MICA_CRYPTO_CTX_USER_OPCODE register. Type 0 does a copy of source data to destination; Type 1 does a copy of inverted source data to destination. No extra data is used for memory-to-memory copy. The Engine number for memory-to-memory copy is 0. 12.4.1.1 Usage Constraints for the Engine This section describes both guidelines and constraints for using the Engine. None Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 191 Chapter 12 Cryptographic Accelerator Interface 12.4.2 Crypto Packet Processor Crypto Packet Processor operations are specified by Engine Type 2 in the MICA_CRYPTO_CTX_USER_OPCODE register. Two copies of the engine (Engine Number 2 and Engine Number 3) are included to achieve higher total throughput, and the scheduler selects which engine is assigned, which is transparent to the Contexts requesting the operation. Note: Any particular Internet Protocol Security (IPSec) flow needs to go through one particular Engine. Crypto Packet Processor performs operations on a packet through the use of a token and a Context Record. Extra Data is used to create the Input Token and Context Record used by Crypto Packet Processor, and to receive the Result Token created by Crypto Packet Processor. • An Input Token is a series of commands informing the Crypto Packet Processor Engine how to process the packet; for example, what headers to insert or remove, what encryption algorithms to use, etc. The size of the Input Token, in number of 4-byte words, is specified in the OPCODE register. • The Context Record supplies information used by the Crypto Packet Processor, such as encryption keys, IVs, etc. • The Result Token supplies status about the operation, including errors detected by Crypto Packet Processor. Note: The detailed description of the Input Token, Context Record, and Result Token is provided in “Packet Processor — Programming” on page 264. The overall flow for using Crypto Packet Processor is described below. Note, this list highlights the unique parts of the flow, relative to the steps described in Chapter 11: Common Accelerator Interface (MiCA), in Section 11.3.1 General Flow. 1. In the Extra Data step of the General Flow, Tile software puts Input Token and Context Record in memory, and also allocates space (32 bytes) for Result Token. 2. If the operation prepends and/or appends data onto the destination packet, Tile software allocates space and sets the Destination Size field (SIZE) of MICA_CRYPTO_CTX_USER_OPCODE register with the appropriate value. 3. Crypto Packet Processor reads the Input Token, Context Record, and source data, and then performs the operation. 4. After writing destination data, the Crypto Packet Processor Engine writes the updated Context Record and Result Token to Extra Data area. 12.4.2.1 Usage Constraints for the Crypto Packet Processor Engine This section describes both guidelines and constraints for using the Crypto Packet Processor Engine. 192 • Configuration/initialization for an engine is done through MMIO access to the engine registers. See “Inline Packet Engine” on page 371 for general information, Appendix D: for the register map, and 11.2.1.2 Overview and Major Functional Blocks of this document to map and access the registers for a particular engine. Only configuration registers ( D.9.1 Configuration Registers on page 463 should be accessed, and only before any packets are processed. The processor comes up in a usable state, so it might not be necessary to make any changes to the configuration. • Many of the runtime registers are accessed by the MiCA and should not be written. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Engine-Specific Details • Do not confuse context record with MiCA context — they are not related. • Tokens for various operations are provided by Tilera. The user must fill in certain fields in the token on a per-packet basis before sending it to the Crypto Packet Processor Engine. See the D.8 Context Record Definition on page 433 and “Inline Packet Engine — Token Examples” on page 491 for details. • MiCA takes care of DMA of packets. • See “Result Token Definition” on page 429 document for a description of the Result Token. 12.4.3 KASUMI and SNOW-3G Engine The KASUMI and SNOW-3G cores share a channel, so only one can be processing an operation at a time. The scheduler assigns Contexts that are requesting operations from either of these cores in a round-robin manner, which is transparent to the Contexts requesting the operation. KASUMI operations are specified by Engine Type 4 in the MICA_CRYPTO_CTX_USER_OPCODE register, and SNOW-3G operations are specified by Engine Type 5. The Engine number for KASUMI/SNOW-3G is 1. Both of these cores use OPCODE register and Extra Data as listed below. The Engine Type field is used to determine which core is used and, therefore, how to use the Extra Data. 12.4.3.1 KASUMI Engine Note: The detailed description of the Extra Data fields by KASUMI is described in “KASUMI Engines” on page 261. Table 12-1. OPCODE Register and Extra Data Description Bits Name Description 56:53 OPCODE 0011 KASUMI Encrypt 0010 KASUMI Decrypt 0101 f8 1001 f9 All other values are illegal. Extra Data – 4 Words Word 0 31:0 config This field is only used for f8 mode and f9 mode. Bit [0], direction Specifies the direction of the f8 or f9 session (uplink or downlink). Bit [5:1], bearer Specifies the Bearer value for f8 sessions. Bit [15:6] Reserved. Bit [31:16], length Specifies the length of the message data in bits during f9 mode. The total message length can be up to 2^16 = 65536 bits. 63:32 Reserved Word 1 31:0 count This field is only used during f8 mode and f9 mode. This is the 32-bit count value. 63:32 fresh data This field is only used during f9 mode. This is the 32-bit fresh value. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 193 Chapter 12 Cryptographic Accelerator Interface Table 12-1. OPCODE Register and Extra Data Description (continued) Bits Name Description key Bits [63:0] of the 128-bit key. key Bits [127:64] of the 128-bit key. Word 2 63:0 Word 3 127:64 Note: For all operations except f9, the destination data size is the same as source data size. For f9 the destination data size is 4 bytes. 12.4.3.2 SNOW-3G Engine Note that the detailed description of the Extra Data fields by SNOW-3G is described in “SNOW3G Engines” on page 255. Table 12-2. OPCODE Register and Extra Data Description Bits Name Description 54:53 OPCODE 00 01 10 11 UEA2/128-EEA1 decrypt UEA2/128-EEA1 encrypt UIA2/128-EIA1 decrypt UIA2/128-EIA1 encrypt Extra Data – 5 Words Word 0 63:0 iv Initialization Vector. This field contains bits 63:0 of the 128-bit IV. • For UEA2 and 128-EEA1 the IV is constructed as follows: {COUNTC|BEARER |DIRECTION|026} • For UIA2 the IV is constructed as follows: {COUNT-I|FRESH} • For 128-EIA1 the IV is constructed as follows: {COUNT-I|BEARER|027} iv Initialization Vector. This field contains bits 127:64 of the 128-bit IV. • For UEA2 and 128-EEA1 the IV is constructed as follows: {COUNTC|BEARER |DIRECTION|026} • For UIA2 the IV is constructed as follows: {COUNT-I|FRESH} • For 128-EIA1 the IV is constructed as follows: {COUNT-I|BEARER|027} key SNOW Key. This field contains bits 63:0 of the 128-bit key SNOW operations. key SNOW Key. This field contains bits 127:64 of the 128-bit key SNOW operations. Word 1 127:64 Word 2 63:0 Word 3 127:64 194 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Engine-Specific Details Table 12-2. OPCODE Register and Extra Data Description (continued) Bits Name Description length Length Vector. This field is the number of bits for each new authentication operation. The maximum value for this input is 2^16-64 (which is 65472). Word 4 15:0 Note: The maximum length value defined by [SNOW3G] is 20000 bits. 63:16 Reserved Note: For all operations destination data size is the same as source data size. 12.4.3.3 Usage Constraints for the Engine This section describes both guidelines and constraints for using the Engine. • As it applies to Extra Data (ED) alignment, in the MICA_CRYPTO_CTX_USER_OPCODE register, ED size is the number of 8-byte words. • If it is an odd number of 4-byte words, the CR is padded with zeros at the end to fill out the remainder of the ED size. 12.4.4 Public Key Accelerator Engine The Public Key Accelerator (PKA) is not accessed via MiCA Context Registers; instead it is accessed via the Crypto Global MMIO space. The reason for this is that the attributes of the operations done by the Public Key Accelerator (PKA) are different than the other engines. In general small operands (for example, ~.5kB – 2.5kB) are operated on for a long time (for example, ~100k – 250k cycles). The time spent moving the operands and results relative to the computation time are small and the latency benefit provided from DMA versus programmed I/O is minimal. The Public Key Accelerator control and status registers, as documented in “Public Key Accelerator Engine” on page 195, occupies 0x10 through 0x13. This allows effectively 20 bits of offset, 18 bits from the normal register offset (which is [17:0]), plus two bits from what is normally the Engine Number ([19:18]). At Engine Number 0x14, Offsets 0-64kB address the Tile-to-PKA Window RAM. Offsets 64kB and above address the Tilera-specific CSRs. “Host memory” as used here and in Tilera Public Key Accelerator documentation refers to the Tile-to-PKA Window RAM, as shown in Figure 12-1. This RAM appears to Tiles as a block of MMIO registers, and to the Public Key Accelerator as 64kB at addresses 0 to 0xFFFF. The Public Key Accelerator also contains a high-performance True Random-Number Generator (TRNG). The PKA command interface uses descriptor/result rings held in Host memory space. Descriptors do not contain any vector data – they contain pointers to vectors in Host memory space. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 195 Chapter 12 Cryptographic Accelerator Interface MMIO Registers and Context State Mesh Interface Context Registers TLB Global Registers Context Specific State To/From Tiles QDN MMIO Requests Engine Scheduler RDN MMIO Read Data and Write Acks/ IPI Interrupts Network Interfaces Read Requests SDN Memory Read and Write Requests PA to Route Header Generation Egress DMA (From Memory to MiCA) Tile-to-PKA Window RAM Read Data Notification Operands Response Queues Results IPI Generator Interrupt Bindings Engine Front End Write Requests RDN Memory Read Data and Write Acks Command Queues Ingress DMA (From MiCA to Memory) Write Data Interrupts Notification Function-Specific Engines Public Key Accelerator Figure 12-1: Public Key Accelerator in Crypto MiCA 12.4.4.1 Descriptor Ring Management Descriptors are 32 bytes in size. Up to four separate command descriptor rings can be used, each accompanied by a result descriptor ring of the same size. Command and result descriptor rings can be co-located or placed at different (non-overlapping) locations in Host space. The number of descriptors on each ring configurable from 1 to 64k in the Public Key Accelerator RING_SIZE_n registers (but because the rings must be allocated in the PKA Window RAM, the actual maximum size is limited). If multiple rings are used, rotating priority will normally be used to select which ring is to supply the next PKA command to execute. It is possible to place Ring 0 at a higher priority than the remaining rings. It is also possible to turn the rotating priority off (in which case Ring 0 gets the lowest priority and Ring 3 the highest priority). We recommend using separate rings if large differences in execution times for commands are expected. This prevents any stalls from results for short execution time commands from being stalled by one result for a long execution time command. The reason for this is that most of the internal buffer RAM is used to buffer command descriptors from each command ring – no new commands can be loaded when the oldest command in this buffer has not yet completed. Read/write pointers for the rings should be kept locally by the Host and the PKA master controller (the latter will use some words of buffer RAM to hold them, providing progress indication and a re-sync capability). No true ‘ownership’ bits are used in the descriptors – these are not necessary 196 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Engine-Specific Details as the command counters can be used to determine if new commands can be written – result descriptors contain two ‘written zero’ bits that can be used (by a driver) for ownership indications, but are mainly intended to prevent result interrupt race problems. 12.4.4.2 Command Descriptor Contents Command descriptors are 32 bytes (8 words of 32 bits) long. The presence of new command descriptors in a ring must be indicated by the Host through incrementing of the command counter associated with that ring (after the descriptor contents have been written). It is possible to link command descriptors so that one command can only be executed when the previous linked command has been executed – these linked descriptors must be transferred into the ring as a whole (that is, the command counter must be incremented by the number of linked commands). The descriptor contains pointers to operands in Host memory, a command field indicating which command to execute and some other miscellaneous information. 12.4.4.3 Result Descriptor Contents After finishing a command, the PKA master controller will convert the original command descriptor into a result descriptor and write this descriptor to the result ring. The result descriptor contains status information about the operation. 12.4.4.4 Interrupts Public Key Accelerator generates the following interrupts. • Command queue empty, four interrupts, one per queue. The threshold is configurable in Public Key Accelerator MICA_CRYPTO_ENG_IRQ_THRESH_n register. The interrupt should only be dismissed (by writing to Public Key Accelerator MICA_CRYPTO_ENG_AIC_ACK register) when new commands are added to the queue. • Result queue full, four interrupts, one per queue. The threshold is configurable in the Public Key Accelerator MICA_CRYPTO_ENG_IRQ_THRESH_n register. Also a timeout value for interrupting on non-empty result queue is configurable in the same register. Dismissing the interrupt is done by writing to Public Key Accelerator MICA_CRYPTO_ENG_RESULT_COUNT_n register (to indicate that results have been processed) followed by writing Public Key Accelerator MICA_CRYPTO_ENG_AIC_ACK register (to acknowledge the interrupt). Note that the writes must be done in this order. • PKA Master interrupt. Interrupt generated directly by the PKA master controller to request attention from the Host (intended for signaling errors and/or completed commands). • TRNG interrupt. Interrupt generated directly by the PKA master controller to request attention from the Host (intended for signaling errors and/or completed commands). Each interrupt has an associated Tile Interrupt Binding MICA_CRYPTO_ENG_INT_BINDING_PKA_QUEUE_n_EMPTY register. When the Public Key Accelerator generates an interrupt it is marked as pending in the IPI Generator logic, which then arbitrates to send an IPI to the bound tile. When the IPI is sent, the interrupt is marked as not pending. The implication of this is that multiple Public Key Accelerator interrupts might coalesce into a single IPI, depending on how quickly they occur. Interrupt Latency In order to keep the Public Key Accelerator fully loaded, the driver must keep its input queue(s) full enough so that there is a new command ready when a farm engine is available. A rough example of interrupt service latency is provided below. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 197 Chapter 12 Cryptographic Accelerator Interface Assumptions: 1. Crypto frequency – 800 MHz. 2. Core frequency – 800 MHz. Note that higher core frequency relative to Crypto frequency will give a longer latency period, since a given PKA operation is measured in cycles of Crypto frequency. 3. A minimum PKA operation – 80k cycles 4. Only one Command / Result queue configured. 5. Command / Result queue size – 20 entries. This uses 1280 Bytes in the PKA Window RAM if the Command and Result queue do not overlap, and 640 Bytes if they do. There is enough space for maximum size operands (4 kbit) to be statically allocated. 6. Result queue threshold – 10 entries. This is chosen to moderate the number of interrupts to one every 10 operations. With these specifications, there will be 10 entries on the command queue when the interrupt is triggered. The command queue must be refilled before those operations are completed, which takes a minimum of 800k cycles. That time can be increased by either having a smaller threshold (more interrupts), or by having more queue entries, which would require dynamically allocating space for operands. 198 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 13 I NTERFACE C OMPRESSION A CCELERATOR 13.1 Overview The software interface uses the MiCA™ standard interface as described in Chapter 11: Common Accelerator Interface (MiCA). This chapter describes those features unique to the compression functionality. The TILE-Gx™ architecture supports hardware acceleration for lossless compression and decompression operations. The Raw DEFLATE format (RFC1951) and GZIP file format (RFC1952) are supported by the hardware accelerators. The TILE-Gx36 implementation supports full-duplex, 10 Gb/s compression and 10 Gb/s decompression. The TILE-Gx36 implementation supports a common Multicore iMesh Coprocessing Accelerator (MiCA) front-end architecture. The TILE-Gx36 implementation provides two TILE-Gx ZIP controllers. Each controller has one compression engine and two decompression engines. Each controller supports forty Contexts. The following terms are used in this chapter: • Transaction. The operation associated with one opcode MMIO write. • Context. Hardware entity that contains information (interrupt bindings, state, source and destination data pointers, opcodes, and other engine-specific information) for processing data. • DEFLATE. A lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. 13.2 Data Flows 13.2.1 Typical Compression Flow The API configures one of the Contexts via MMIO packets over the QDN network. The API specifies the opcode, transfer size, source data descriptor, and the destination data descriptor. Once a Context is programmed, it generates a pending compression transaction to be processed. The GZIP controller selects the next compression transaction, if there is an available engine to process the transaction. New Contexts are scheduled on a round-robin basis. The selected Context is considered “active”. The eDMA engine first performs a TLB lookup, and then decodes the source data descriptor. The engine then fetches the uncompressed source data from the memory over the SDN network, cache line-by-cache line. If the source data is not aligned with the cache line boundary, the eDMA engine discards the unused bytes from that line. The source data might not be returned in the order that it was requested. The eDMA engine assembles the returned source data from the RDN network. The eDMA engine uses its own network buffer to smooth out the latency jitter from the memory space. If the data comes from the cache, the latency tends to be lower than if it comes from external DRAM. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 199 Chapter 13 Compression Accelerator Interface The compression engine performs the data compression. The raw DEFLATE format (RFC1951) and GZIP format (RFC1952) are supported. The iDMA engine assembles the compressed data generated from the compression engine. Then it sends the compressed data, cacheline-by-cacheline, back to memory over the SDN network. The iDMA engine uses its own network buffer to smooth out the latency jitter to the memory space. A masked write will be used for transfers smaller than one cache line; pad writes are not supported. After the iDMA engine sends the last piece of the compressed data to the memory space, it waits for an acknowledgement that all writes to memory are visible. Then a completion interrupt is dispatched over the RDN network. The API then checks the completion status via an MMIO packet. This packet includes an indication of successful completion or an error status. It also indicates the number of bytes of compressed data. A new transaction operation (from a different Context) can be scheduled to the same compression engine once the last flit of the data from the previous transaction is on the way to the mesh network. 13.2.2 Typical Decompression Flow The decompression flow is very similar to the compression flow: the API configures a context, the eDMA engine fetches the source data, the decompression engine performs the decompression, and the iDMA engine sends the compressed data back to memory. A completion interrupt is dispatched after the writes are visible in the memory space. The iDMA engine for decompression handles a higher output data rate than the one for compression, as the uncompressed data is typically larger than the compressed data. 13.3 Compression Engine 13.3.1 Engine Configuration The compression engine can be customized in four different input patterns. Typically, accomplishing a better compression ratio requires more processing time — the pattern you choose would accommodate that requirement. Users must weigh these considerations when deciding how to configure the engine for a given application. Users can specify the following parameters: • NICE_MATCH • GOOD_MATCH • MAX_CHAIN_LENGTH • MAX_DIST • MATCH0_RATIO • MATCH1_RATIO • DTREE • BINARY_HINT • SMALL_PACKET_HINT • HIGH_COMPRESSION_HINT Refer to the compression engine register definitions in “Compression/Decompression Engine Registers” on page 202 for more details. 200 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Decompression Engine 13.3.2 GZIP Handling The hardware engine can be configured to generate compressed data in GZIP format. The configuration can be set using the opcode flag or the MICA_COMP_ENG_DEFL_REG_DEF_CTL register. The GZIP header will always be generated as the following: Table 13-1. GZIP Header Format Structure Value ID1 0x1F ID2 0x8B CM 0x8 FLG 0x0 MTIME 0x0 XFL 0x4 OS 0x3 (UNIX) The GZIP trailer contains two fields: a CRC32 (ISO 3309) for integrity checking and the input size of the original (uncompressed) data. 13.4 Decompression Engine 13.4.1 GZIP Handling The hardware engine automatically detects the format of the compressed data, either raw DEFLATE format or GZIP format. If the GZIP format is detected, the engine parses the header fields, including all optional fields, such as a header CRC16, extra fields, the file name, and comments. For header integrity, the extracted header fields are used to check against the optional CRC16. The context status is updated accordingly in case of an error. The engine does not forward the extracted header fields to the user space, as the header fields are in readable format and can be parsed by the user application. For payload integrity, the uncompressed data is checked against the CRC32. The size of the uncompressed data is checked against the input byte (ISIZE) that is part of the GZIP trailer. 13.5 Memory-to-Memory Copy Memory-to-Memory copy operations are specified by Engine Types 0 and 1 in the MICA_COMP_CTX_USER_OPCODE register. Type 0 does a copy of source data to destination; Type 1 does a copy of inverted source data to destination. No extra data is used for memory-to-memory copies. Compression engine and decompression engines are capable of supporting memory-to-memory copy operations. At run time, compression engine or decompression engine (but not both) should be configured to support memory-to-memory copy operations. 13.6 API The compression controller supports a common MiCA front-end architecture. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 201 Chapter 13 Compression Accelerator Interface 13.6.1 Context Registers Each GZIP controller supports 40 Contexts. Each context has two distinct sets of registers, which allows for assigning different protection levels, typically for User and System space. The User space registers typically include the OPCODE register, source data descriptor register (MICA_COMP_CTX_USER_SRC_DATA), destination descriptor register (MICA_COMP_CTX_USER_CONTEXT_STATUS), and operation length register (MICA_COMP_CTX_USER_OPCODE). The System space registers typically include the interrupt binding registers (MICA_COMP_CTX_SYS_COMP_INT), TLB handling registers (MICA_COMP_CTX_SYS_TLB_MISS_INT, for example), and system status registers (MICA_COMP_CTX_USER_CONTEXT_STATUS), etc. Refer to Chapter 11: Common Accelerator Interface (MiCA) for more details. 13.6.2 Compression/Decompression Engine Registers Each compression engine and decompression engine has its own set of registers, including the configuration registers and performance counters. OPCODE The GZIP controller supports the following ENGINE_TYPEs in the MICA_COMP_CTX_USER_OPCODE register. Table 13-2. ENGINE_TYPE Register Description ENGINE_TYPE Description 000 Memory copy 001 Memory copy with data inversion 010 Compression 011 Decompression 1xx Reserved Each MICA_COMP_CTX_USER_OPCODE register has an engine-specific field to specify a flag associated with the ENGINE_TYPE. The compression engine supports the flags listed in Table 13-3. Refer to the Deflate Engine Configure Control register (MICA_COMP_ENG_DEFL_DEF_CTL) for more details on each field. Table 13-3. Compression Engine Flags 202 Bits Name 9 CHINT 8 SHINT 7 BHINT 6 DTREE 5:2 MAX_CHAIN_LENGTH 1 FORMAT 0 CONFIG Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice API No “Flag” bits are defined for the decompression engine. 13.6.3 Status Registers Once a compression transaction or decompression transaction is completed, the per-context MICA_COMP_CTX_USER_CONTEXT_STATUS register provides the status of the operation: the number of bytes of the last transaction result, and the exception status, if any. Decompression Exception Status Table 13-4. Decompression Exception Status Register Description Exception Description STS_BTYPE_ERR BTYPE has an unsupported error type. STS_NLEN_ERR The number of data bytes in a non-compressed block is corrupted. STS_SYMBOL_ERR An unrecognized Huffman symbol is detected. STS_GZ_CRC16_ERR The GZIP header contains incorrect CRC16. STS_GZ_CRC32_ERR The GZIP payload contains incorrect CRC32. STS_GZ_ISIZE_ERR The number of uncompressed GZIP data bytes does not match the original raw (uncompressed) data bytes. 13.6.4 Transaction Size The following items are characteristics of the compression/decompression engines: • Different size. The compression engine and decompression engine typically produces data output that has a different size than the data input. The output size is usually smaller than input size for compression, and the output size is usually larger than input size for decompression. • Large size. The maximum size of the compressed or uncompressed buffer is 64MB in each transaction. The transactions are stateless, that is a flush is performed between the transactions. • Small size. Although the compression engine supports small packet sizes, small packets will result in non-optimal performance. A threshold can be set so that smaller packets will not be compressed and will be copied verbatim instead. For example, RFC 2394 refers to an implementation and users should not attempt to compress buffers smaller than 90 bytes. 13.6.5 Data Expansion Handling In some circumstances, the output data from a compression operation will be larger than the input data. This condition might occur for a random data stream (for example, a previously compressed or encrypted file), or for extremely small packet sizes. Be aware that you should not use the expanded compressed data, but copy the original data verbatim instead. It is up to the user application to choose either the uncompressed data or compressed data, based on the MICA_COMP_CTX_USER_CONTEXT_STATUS register. The source data buffer that is holding the uncompressed data should be released only after the compression process is completed. If the user application chooses to copy the data uncompressed, a 5-byte DEFLATE header must be appended: one byte of 0x1, two bytes of LEN to indicate the number of bytes of uncompressed data, and two bytes containing the one’s-complement of the LEN bitfield. Segmentation is necessary if the uncompressed data is greater than 64 K bytes, as the LEN bitfield has two bytes. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 203 Chapter 13 Compression Accelerator Interface 13.6.6 Performance Counter Various performance counting events can be selected. For details, refer to the MICA_COMP_ENG_DEFL_PERF_CTL_0 and MICA_DEFL_PERF_CTL_1 registers. 204 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 14 F LEXIBLE I/O I NTERFACE 14.1 Overview The flexible I/O interface configures and formats data for up to 64 data pins. These data pins can be used to implement low-speed status and control bits, or to implement moderate speed asynchronous interface protocols such as HPI, ATA, etc. Each I/O pin can be individually configured to be an input, output, or bidirectional pin with a number of drive and input options. The interface supports simultaneous use by multiple processes with full protection and virtualization support. The virtualization support allows direct access to the interface by application-level programs with full process isolation and protection. MMIO transfers are used to configure the interface and to supply and receive data from the I/O pins. An interrupt capability is supplied to allow interrupts to be generated on any transition of a pin. The high-level structure of the interface is as shown in Figure 14-1. MMIO Registers Pin Format 64 RD DATA RD DATA I/O Pads INT 64 OE WR DATA CMD WR DATA PROT MSK 64 DOUT 64 DIN 64 64 PINS 64 64 Figure 14-1: Flexible I/O Interface 14.2 Virtualization and Protection Support The flexible I/O interface supports virtualization and protection using two related mechanisms. The registers set of the interface is duplicated at eight different addresses, each set representing a service domain. Associated with each service domain is a programmable privilege level and a pin access mask. A MMIO register access is valid only if the privilege level of the service domain that is being accessed is greater or equal to the required privilege level of the register. If the register access is valid and is to a configuration register, then the access is performed. If the access is invalid, an error is logged and the action is inhibited. Invalid MMIO writes are ignored, and invalid MMIO reads return all zeros. If the register access is legal and is to a pin manipulation register or interrupt vector register, then the access is further filtered by the service domain pin access mask. If a MMIO write is occurring, then only register bits are updated if the corresponding bit is set in the pin access mask. If the pin Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 205 Chapter 14 Flexible I/O Interface access mask is clear for a given bit, then the update of that bit is ignored. For MMIO reads, the data is read for bits that have the pin access mask set, and 0 is read for bits that have the pin access mask clear. 14.3 MMIO Register Map The flexible I/O interface register map contains eight identical copies of the MMIO registers. Each copy corresponds to a different service domain as specified by the SVC_DOM field in the MMIO address. The service domain attributes are used to control register protection and filtering of the MMIO accesses. The registers that provide access to data for multiple I/O pins are further masked by the service domain pin access mask. These registers are arranged into two groups: Pin access registers (GPIO_PIN_STATE through GPIO_PIN_INPUT_CND) and Interrupt control registers (GPIO_INT_VEC0_W1TC through GPIO_INT_VEC1_RTC). The full map is shown in the GPIO_HTML. 14.4 Interrupts The flexible I/O interface supports interrupts on the rising of falling transition of each I/O pin. The interrupts are specified on a per-pin basis, and can be vectored to any tile event and interrupt vector number. The binding of the per-pin interrupts are specified by the GPIO_INT_BIND register. Interrupts can be dismissed using either MMIO read or MMIO write accesses as the application requires. The GPIO_INT_VEC0_W1TC and GPIO_INT_VEC1_W1TC registers allow reading of the interrupt state, but only dismissing a subset of the pending interrupts, while reading the GPIO_INT_VEC0_RTC or GPIO_INT_VEC1_RTC registers returns the currently-pending interrupts and clears all of them. In this case the application will need to process all interrupts without leaving any pending. 14.5 I/O Pin Driver Configuration Each pin can be individually configured to be an input, output, or bidirectional pin with or without internal pullup/pulldown resistors. The output pins can support either normal or open-drain drivers and support output drives from 4 mA to 12 mA with optional slew-rate control. The input pins can be configured to be normal inputs or Schmitt trigger inputs. 14.6 I/O Pin Clocking Control All input pins are sampled, and all output pins and enables are applied synchronously to an internally generated clock named GCLK. The interface configures the source and frequency of GCLK in the GPIO_GCLK_MODE register. The interface will guarantee that two sequential register accesses will be performed on different GCLK edges. The GCLK can be generated from two different clock sources: CCLK divided by 2 and CORE_REF_CLK, and can be further divided by the DIVIDE parameter of the GPIO_GCLK_MODE register. Changing the GCLK period permits an application that is supplying data and a strobe to guarantee at least one GCLK period of setup and hold for the pin accesses of a data bus and a data strobe. For example, a write of a set of data bits followed by a write asserting the strobe will guarantee that the data bits are stable for one GCLK cycle before the strobe occurs. This also guarantees that the next write to the data will not occur until one GCLK cycle after the assertion of the strobe. Additionally on the input side, the application can read the pin state waiting for the assertion of the strobe, and can perform a separate read of the data bus if the external device does not supply sufficient setup time. This access pattern guarantees that the data bus will be sampled at least one GCLK period after the appearance of the strobe. 206 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Pin Control and Data Accesses One issue with a slow GCLK period is that there is limited buffering for MMIO transactions in the flexible I/O interface itself; exceeding this limit can cause performance issues with other transactions on the iMesh. The flexible I/O interface can buffer eight MMIO transactions, so that applications that use a slow GCLK frequency and need to send more than eight MMIO transactions without waiting for results will need to employ some kind of flow control. Flow control can be achieved by insertion of a MF instruction every few MMIO accesses to the flexible I/O interface, which will cause the processor to wait until the previous MMIO accesses have completed before issuing further MMIO accesses. Alternatively, the program can perform a MMIO read request and use the returned data. By definition that implies that all previous MMIO accesses have completed. This restriction is not that severe because most applications do not require slow GCLK frequencies, and would normally do MMIO read accesses before sequential MMIO write accesses. 14.7 Pin Control and Data Accesses The direction of the I/O pin interface can be configured by programming the GPIO_PIN_DIR_I and GPIO_PIN_DIR_O registers. The supported modes are input-only, output, and open drain. Pins in the output mode can be used as bidirectional pins by disabling (releasing) the driver without changing the mode. Input pins can select either a normal receiver or the Schmitt trigger input in the GPIO_PAD_CONTROL register. The input state of any pin can be ascertained by reading the GPIO_PIN_STATE register. The value read can be conditionally inverted to support active-low signals by specifying the contents of GPIO_PIN_INPUT_INV. The input can also be configured to have additional two-level synchronization to avoid noise sampling problems in the GPIO_PIN_INPUT_SYNC register. The output value of a pin can be specified through a number of registers, based on the application requirements. The output value can be conditionally inverted by programming the GPIO_PIN_OUTPUT_INV register. The output driver can be disabled so that the pin can be used as an input on a bidirectional bus by writing the GPIO_PIN_RELEASE register. Output pulses with a duration of a single GCLK cycle can be created by writing the GPIO_PIN_PULSE_SET and GPIO_PIN_PULSE_CLR registers. Normal output data is written by using the GPIO_PIN_STATE register, but a few special operations are implemented using additional data registers. These additional operations permit clearing, setting, or toggling the output register. See the GPIO_PIN_xxx register documentation for details, starting at GPIO_PIN_STATE and ending at GPIO_PIN_INPUT_CND. 14.8 Reset/Initialization At reset, all I/O pins are configured to input pins with the output driver disabled, and the GCLK frequency is reset to be the CORE_REF_CLK. 14.9 Performance The maximum rate of data transitions of the I/O pins is limited by the GCLK frequency. The maximum frequency of GCLK is the maximum frequency of CORE_REF_CLK and half the frequency of core clock. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 207 Chapter 14 Flexible I/O Interface 208 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 15 R SHIM I NTERFACES The Rshim contains chip-level services for boot and debug. It also hosts a number of the low speed interfaces such as UARTs, I2C-Masters, I2C-Slave, and SPI. The Rshim provides a register interface that is accessible from the locally-hosted devices, Tile software, and remote devices such as PCIe, JTAG, and USB. Thus the generic boot and debug services hosted by the Rshim are available both for Tile software and for external agents connecting through USB, PCIe, UART, or I 2C. The sections below describe a number of the Rshim’s services. Additional services are described within the Rshim register documentation. 15.1 Level-1 Boot Level-1 boot is achieved by sending boot data on the UDN to a target Tile. The Rshim provides access to the UDN via the packet generator (RSH_PG_CTL, RSH_PG_DATA registers). Thus any device that has access to Rshim registers can provide the level-1 boot stream to the Tiles. The packet generator is reset to a state that directly sends writes to the RSH_PG_DATA register onto the UDN. A 4KB boot buffer provides elasticity to the boot stream. An external agent providing the boot stream can read the SENT_COUNT field in the RSH_PG_CTL register to determine how much boot data has been sent to the Tile and hence how much space is available in the 4KB boot buffer. By preventing the boot buffer from filling, the external agent can insure that accesses to other Rshim registers will never be blocked and hence even a wedged level-1 boot process can be debugged via access to other Rshim registers (RSH_JTAG_CONTROL or RSH_RESET_CONTROL for example). 15.2 I/O Discovery The Rshim provides registers that are used for generic I/O device discovery software. These include the RSH_FABRIC_CONN, RSH_FABRIC_DIM, and RSH_IPI_LOC registers. Tile software can read these registers to determine how the mesh is configured and where various I/O components are attached. Additionally, the RSH_TILE_COL_DISABLE registers indicate which Tiles within the RSH_FABRIC_DIM limits are not available even though their mesh switch is present. The device discovery process is described in more detail in Section 1.1.5 Device Discovery on page 3. 15.3 tile-monitor FIFOs The Rshim provides two FIFOs for general purpose communication between Tile software and an external agent, or “host”. The intended use is for one FIFO to be dedicated for host-to-Tile communication and the other FIFO to be dedicated for Tile-to-host communication. Each FIFO has Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 209 Chapter 15 Rshim Interfaces associated high-water-mark and low-water-mark interrupts. Although the FIFOs are general purpose, the USB endpoint interface does make assumptions about how they are used in the case where its tile-monitor capability is enabled. The tile-monitor FIFOs are accessible via the RSH_TM_HOST_TO_TILE and RSH_TM_TILE_TO_HOST registers. 15.4 Down-Counters and Watchdog The Rshim provides three independent 48-bit down-counters that can be connected to interrupts or used as watchdogs to initiate chip resets. The counters can be externally-referenced to a realtime clock source or can be timed based on REFCLK_CORE. For more information about clock inputs, refer to the appropriate TILE-Gx™ data sheet. The down counters are contained in the RSH_DOWN_COUNT registers (for example the RSH_DOWN_COUNT_CONTROL register). 15.5 Rshim JTAG Rshim JTAG provides a mechanism to access Tilera-specific JTAG registers inside the TILE-Gx™ processor. These registers are typically used to access diagnostics features within the Tile. The Rshim JTAG controls are located in the RSH_JTAG registers (for example the RSH_JTAG_CONTROL register). 15.6 Reset Control The TILE-Gx processor can be partially or completely reset using the RSH_RESET_CONTROL, RSH_RESET_MASK, and RSH_IO_RESET registers. These registers allow for individual I/O devices to be reset without impacting the reset of the device. Or an I/O device can be left intact while resetting the rest of the device. This feature is useful when the part needs to be reset to load new software but a particular I/O port needs to remain “up” while the chip is being reset. The RSH_BREADCRUMB and RSH_SCRATCH_BUF registers retain state during software reset so they can be used to indicate status to the rebooted system (POST failures for example). 15.7 Byte Access Interface Some external hosts that map directly into Rshim register space through PCIe can only support 32-bit accesses. In this case, the RSH_BYTE_ACC registers (for example the RSH_BYTE_ACC_CTL register) provide an indirect-access mechanism that allows atomic access to TILE-Gx’s 64-bit register space. This mechanism is not required for UART, I 2C, or USB hosts, since those interfaces define a 64-bit access mechanism through the host’s API. 15.8 Remote Interface Access and Device Protection Devices not directly attached to the Rshim, such as PCIe and USB, use a dedicated on-chip interconnect to access Rshim register space. This allows chip-level boot and debug without interference with the mesh networks. The remote devices are defined in the RSH_DEVICE_PROTECTION register. This register can be used to block access by remote devices or locally-attached Rshim devices. 210 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 16 UART I NTERFACES 16.1 UART Interface 16.1.1 Overview The UART controller is used to communicate between the tiles and external device (via the two UART serial bits). The UART controller uses a 256-byte transmit FIFO and a 256-byte receive FIFO. Figure 16-1 provides a graphical view of the UART interface. RefClk Register Tile/RSHIM Interface Write FIFO FSM TX FIFO UART Protocol Controller serial_tx Remote UART serial_rx RX FIFO Figure 16-1: UART Interface Block Diagram The UART interface can operate in two modes: interrupt mode and protocol mode. In interrupt mode, the UART interface provides a typical transmit/receive interface between an external device and on-chip tiles. Data written to the write FIFO by a tile is transferred to an internal transmit FIFO and then transmitted out the serial transmit output. Data received on the serial receive input is transferred to the receive FIFO, which the tile can then read. In protocol mode, the UART interfaces provides an external devices with the ability to read or write any register in any Rshim device. Bytes received via the serial receive input are interpreted as register reads or write commands. Read responses are transmitted via the serial transmit output. 16.1.1.1Protocol Mode Protocol mode can be enabled either via a boot strap pin or through UART configuration registers. In order to use protocol mode, the UART must be configured to use 8-bit wide data. A typical usage of protocol mode provides boot code to the chip by writing to the Rshim STN Data register. Because it provides access to any register in any Rshim device, it can also be useful for diagnostic purposes. In protocol mode, the bytes received are interpreted by the UART interface as “segments” where each segment can write to or read from any register in any Rshim device. The format of a segment is described in Figure 16-2. Each segment has four bytes of header and n bytes of data. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 211 Chapter 16 UART Interfaces 7 6 4 0 bytes[7:0] dest[4:0] bytes[10:8] dest[12:5] Read dest[15:13] channel[3:0] segment 0 Data 0 Data 0 . . . bytes[7:0] dest[4:0] bytes[10:8] dest[12:5] Read channel[3:0] dest[15:13] segment n Data n Data n Figure 16-2: Protocol Mode Table 16-1. Protocol Mode Format Descriptions Bitfield Description bytes[10:0] Specifies the transfer size in bytes (excluding the fixed 4 bytes of header). 1 n Transfer 1 byte Transfer n bytes Note: For a read request, transfer size must be 1 byte or 8 bytes. For a write request, transfer size must be 1 to 8 bytes (bytes = 1/2/3/4/5/6/7/8). A boot load is a special write request, where the transfer size must be multiple of 8 bytes. channel[3:0] Channel number in the Rshim. dest[15:0] Destination address in the Rshim. Lower 3 bits indicates the byte offset. read Specifies the direction of the transfer. 1 0 data 212 Read request (for example external UART reads an address in Rshim) Write request (for example external UART writes an address in Rshim) Data. Note that a read request does not have a data field (A read data transmits in the opposite direction.). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice UART Interface 16.1.2 Data Flows 16.1.2.1Receiving Data In interrupt mode, the tile reads the UART receive data, one byte at a time. The transfer size of each read command is always 1 byte, that is each read (UART_RECEIVE_DATA register) returns 1 byte of receive data in bits[7:0] of the data bus on the register interface. There are two usage models used for receiving data. • Interrupt The UART_RECEIVE_DATA register can be read when the UARTSH_RFIFO_WE or UARTSH_RFIFO_AFULL interrupt is generated (refer to the UARTSH interrupt status register), which means there is receive data available. Tile software can clear the interrupt bits, read polls the UART_FIFO_COUNT register to determine how many available entries in the receive FIFO. • Status Polling Software first polls the UART_FIFO_COUNT register. When the receive FIFO is not empty, issues the Read command(s) accordingly. 16.1.2.2Transmitting Data The Rshim writes the UART transmit data (UART_TRANSMIT_DATA register), one byte at a time. The transfer size of each write command is always 1 byte on bits[7:0] of the data bus on register interface, that is each write (UART_TRANSMIT_DATA register) command delivers 1 byte of write data. Once the write data is written, the transmit data will be sent to the external device as long as there is no pending transmit data. One of two usage models can be used to issue transmit data. • Interrupt Software can issue up to two write (UART_TRANSMIT_DATA register) commands initially. The transmit data register can be written again when the UARTSH_WFIFO_RE interrupt is asserted, which means one transmit data has been consumed (by the transmit FIFO). Software can clear the interrupt bit, then it polls the UART_FIFO_COUNT register to determine how many write (UART_TRANSMIT_DATA register) command to issue. • Status Polling Software polls the UART_FIFO_COUNT register. When the write FIFO is not full, the software issues the write command(s) accordingly. 16.1.3 Flow Control • Hardware CTS (clear to send)/RTS (request to send) flow control is NOT supported by the UART interface. • XON/XOFF style flow control can be implemented in the higher level software. 16.1.4 Master Arbitration All registers in the UART Interface can be accessed by both tiles and an external device. However, there are two exceptions to normal register read/write access. • The UART_ELECTRICAL_CONTROL register is not writable by the external UART (the write request is simply dropped). This is the only register that is writable by a tile but not by an external device. • The UART_RECEIVE_DATA register can be read by an external device. However, the data entry will not be popped from the receive FIFO (for example, it will remain in the receive FIFO). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 213 Chapter 16 UART Interfaces In case both the external UART and the tiles want to write to the same register, the write request from the external UART will have a higher priority. The write request from the tiles will be queued and be served later. 16.1.5 8/64 Bits Handling In the protocol mode, there is 8/64 bits handling logic. The remote UART is byte oriented, where registers in the UART Shim and the Rshim are word oriented. 16.1.5.1Remote UART Writes Because byte mask is not available, remote UART write requests should be performed in 8 bytes quantities, starting from byte address 0 (for example in the sequence of byte 0, byte 1, byte 2, byte 3, …, byte7). Partial write will be filled with 0s, for example if writes are performed to byte 1 and byte 2 only, then the rest of the bytes will be filled with 0s. Unaligned dest[2:0] is supported, but the ending address of a write should not cross the 8 byte boundary. If it does, it is considered as illegal and write data will be written to a wrapping address, for example if dest[1:0]=1 and bytes=8, then first seven bytes of UART write data will be written to byte address 1, byte address 2, byte address 3, .., byte address 7. The eighth byte of UART write data will be written to byte address 0 (wrapping address). 16.1.5.2Remote UART Reads Remote UART must perform read request either in one byte or in eight bytes at a time in the protocol mode. Unaligned dest[2:0] is supported. Some read request may have a side effect, for example, a read request may trigger a FIFO “pop” operation. As such, the bytes field must be configured accordingly. 16.1.6 Error Handling and Interrupts When error conditions occur, they are logged in the UART_INTERRUPT_STATUS register. When this occurs, an interrupt can be sent to a tile depending on the mask setting (in the UART_INTERRUPT_MASK register) and the credit state of the binding being used. The UART interface uses the four shared Rshim bindings to determine the destination tile for interrupts and each interrupt can be programmed to use any of those bindings. See the UART interrupt register (UART_MODE register) definitions for more information about the specific errors that are handled. 16.1.7 UART Controller Registers For detailed descriptions of the UART registers, refer to uart.html. 214 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 17 I 2 C M ASTER I NTERFACE 17.1 Overview I2C Master Interface provides an interface for tiles to write and read an external I2C devices. It also includes a hardware state machine that can read from an external EEPROM at boot and then write to any Rshim device register. External I2C devices can be classified as the following three groups: • Generic I2C port (for example: on a controller) • Bootable I2C-based serial EEPROM (for example: boot ROM) • Non-bootable I2C based serial EEPROM (for example: the Serial Presence Detect (SPD) EEPROM on DDR2 DIMM) I2C Master Interface supports serial EEPROM up to 1M bits. If the serial EEPROM is used as a boot ROM, then the size must be in the range of 32K bits to 1M bits; that is EEPROM must have a 16-bit word address in addition to the 7-bit device ID. I2C Master Interface does not support clock stretching. 125MHz rshim Register Interface 100 to 400KHz Write FIFO (wfifo) Read FIFO (rfifo) FSM I2C Interface External I2C Devices (For example: Serial EEPROM) Host Interface rshim Buffer FIFO (bsfifo) Figure 17-1: I2C Master Interface Block Diagram 17.1.1 I 2 C Master Boot Options The following boot options are available: • Boot request type. Both hard boot request and soft boot request (via special boot instruction) are supported. The hard boot request is typically asserted after power on reset and when system is ready (such as when a clock is stable). • Boot ROM type. A bootstrap pin determines if it is booted from SPI-based flash or I2C-based EEPROM. The software boot instruction determines the boot type afterwards. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 215 Chapter 17 I2C Master Interface • Boot code destination. A boot code segment can be sent to different destinations. For example, the first segment can be sent to a UART device in the Rshim, the second segment can be sent to an I2C device in the Rshim, and the rest of segments can be sent to tiles. 17.1.2 Boot ROM Format Figure 17-2 presents the boot ROM format, which is comprised of three regions, which includes the following characteristics: · An eight-byte ROM header. · One or multiple program segments. In each program segment, there is an eight byte segment header followed by program code (in eight byte quantities). · A user data region. rsvd0 e rsvd1 rev_id dest rsvd1 header word Boot Code 0 segment 0 Boot Code 0 . . . e rsvd1 dest rsvd1 word Boot Code n segment n Boot Code n Data Figure 17-2: I2C Master Boot ROM Format Table 17-1. I2C Master Boot ROM Format Field rev_id[7:0] Description Revision of the BOOT ROM. This ID will be stored in a Boot Revision ID (I2CM_- BOOT_REVISION_ID) register after the boot. rsvd0[55:0] Reserved. rsvd1[14:0] Reserved. word[16:0] Specifies the number of (8B) words in the segment, including the segment header (e, dest, word). 1 n 216 1 word = 64 bits nwords = 64 n bits Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Usage Model Table 17-1. I2C Master Boot ROM Format (continued) Field Description dest[16:0] Specifies the destination of the segment. [16:13] [12:0] channel number word address number e 1 Specifies that this segment is the end of the boot code. Boot Code Only the boot code portion will be forwarded to the specified destination. The format of boot code is transparent to the I2C Master Interface. I2C Master Interface does not interpret the boot code. Data User data section of the ROM will not be processed by the controller. 17.1.3 Boot Operations A boot sequence starts when a hard or soft boot request is received by the I2C master interface. I2C master shim always boots from address 0 in the boot ROM. The I2C master interface will read the boot header. The I2C Master Interface then reads boot code segments until the last segment is finished. Only the program portion will be forwarded to the destination. The ROM header and the segment header(s) are not forwarded and are used by the I2C Master Interface. A boot strapping pin, CONFIG_I2C[0], selects the I2C device address from which to boot. This feature can be used when the chip boots from an I2C boot ROM, which has an address conflict with the SPD ROM on the DDR3 DIMMs. • CONFIG_I2C[0] = 1: boot from the ROM located at device address 1010_100. • CONFIG_I2C[0] = 0: boot from the ROM located at device address 1010_000. Note: For more information about the CONFIG_I2C[0] pin, refer to the appropriate data sheet for your processor. 17.2 Usage Model 17.2.1 Generic Operation Generic read and write operations are supported on the I2C interface. The following figures show the read and write operations on the I2C interface. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 217 Chapter 17 I2C Master Interface WORD ADDRESS [7:0] REPEATED START WRITE START SLAVE ADDRESS DUMMY WRITE ACK LSB MSB ACK R/Wn LSB MSB DATA1 DATA2 STOP1 READ REPEATED START SLAVE ADDRESS DATAn NACK LSB MSB ACK LSB MSB ACK LSB ACK MSB R/Wn LSB MSB Figure 17-3: 8-Bit Address Read1 WORD ADDRESS [15:8] WORD ADDRESS [7:0] REPEATED START WRITE START SLAVE ADDRESS DUMMY WRITE ACK LSB DATA2 STOP READ DATA1 MSB ACK LSB MSB ACK R/Wn LSB MSB REPEATED START SLAVE ADDRESS DATAn NACK LSB MSB ACK LSB MSB ACK LSB MSB ACK R/Wn LSB MSB Figure 17-4: 16-Bit Address Read 1. Refer to I2CM_NO_STOP for more details. 218 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Usage Model DATA1 DATA2 STOP READ REPEATED START SLAVE ADDRESS DATAn NACK LSB MSB ACK LSB MSB ACK LSB MSB ACK R/Wn LSB MSB Figure 17-5: Read without Address WORD ADDRESS [7:0] STOP1 WRITE START SLAVE ADDRESS DATAn DATA1 NOACK ACK LSB MSB ACK LSB MSB ACK R/Wn LSB MSB Figure 17-6: 8-Bit Address Write1 WRITE START SLAVE ADDRESS WORD ADDRESS [15:8] LSB ACK ACK MSB LSB MSB ACK R/Wn LSB MSB STOP DATA1 WORD ADDRESS [7:0] DATAn NOACK LSB LSB ACK MSB Figure 17-7: 16-Bit Address Write 1. Refer to I2CM_NO_STOP for more details. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 219 Chapter 17 I2C Master Interface DATA1 STOP WRITE START SLAVE ADDRESS DATAn NOACK LSB ACK LSB MSB ACK R/Wn LSB MSB Figure 17-8: Write without Address 17.2.2 Software Instructions The I2C master interface supports the following instructions: Table 17-2. Supported Instructions Instruction Description Instruction format BOOT Boot 10 READ Read 00 WRITE Write 01 Software Commands to Execute I2C EEPROM Instructions There are three steps to executing a read or write operation listed in Table 17-2: 1. Program the I2C Address (I2CM_ADDRESS) register and program the I2C Byte (I2CM_BYTE) register. 2. Program the I2C Instruction (I2CM_INSTRUCTION) register to start the operation. 3. Start data handling by either writing to the I2C Write Data (I2CM_WRITE_DATA) register or by reading from the I2C Read Data (I2CM_READ_DATA) register. To program the I2C Instruction (I2CM_INSTRUCTION) register, software needs to make sure that the I2C interface is idle. There are two usage models: • Interrupt When the I2C_INST_EXEC interrupt (from the I2CM_INT_VEC_W1TC register) is asserted, it means the previous instruction is done. Software first clears the I2C_INST_EXEC interrupt bit, and then writes to I2C Instruction (I2CM_INSTRUCTION) register. • Status Polling Software can poll the I2C_BUSY bit of the I2CM_FLAG register, and write to the I2C Instruction (I2CM_INSTRUCTION) register when the I2C_BUSY=0. Write data will be programmed by writing to the I2C Write Data (I2CM_WRITE_DATA) register. Each write pushes one entry to the write FIFO. To write to the I2C Write Data (I2CM_WRITE_DATA) register, software makes sure that the write FIFO is not full. There are two usage models: • 220 Interrupt Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Usage Model When the I2C_WFIFO_READ interrupt (of the I2CM_INT_VEC_W1TC register) is asserted, it means previous write data has been processed. Software first clears the I2C_WFIFO_READ interrupt bit. Software then writes to the I2C Write Data (I2CM_WRITE_DATA) register based on the status of the write FIFO (I2CM_FLAG register). For example, the write FIFO is full if I2C_WFIFO_FULL=1; the write FIFO has one data entry if I2C_WFIFO_FULL=0 and I2C_WFIFO_EMPTY=0; the write FIFO is empty if I2C_WFIFO_EMPTY=1. • Status Polling Software can poll the I2CM_FLAG register, and write to the I2C Write Data (I2CM_WRITE_DATA) register when the write FIFO is not full. Read data will be returned by reading the I2C Read Data (I2CM_READ_DATA) register. Each read returns (pops) one entry from the read FIFO. To read from the I2C Read Data register, software makes sure that the read FIFO is not empty. There are two usage models: • Interrupt When the I2C_RFIFO_WRITE interrupt is asserted, it means new read data is available. Software first clears the I2C_RFIFO_WRITE interrupt bit. Software then reads the I2C Read Data (I2CM_READ_DATA) register based on the status of the read FIFO (I2CM_FLAG register). • Status Polling Software can poll the I2CM_FLAG register, and read from the I2C Read Data (I2CM_READ_DATA) register when the read FIFO is not empty. Table 17-1 presents examples on how an I2C instruction will be implemented by a sequence of software reads and writes. For simplicity, the detailed steps of how to write the I2C Instruction register is marked as Write (I2C_INSTRUCTION). The detailed steps of how to write the I2C Write Data (I2CM_WRITE_DATA) register is marked as Write (in the WDAT bit). The detailed steps of how to read the I2C Read Data (I2CM_READ_DATA) register is marked as Read (in the RDAT bit)). Table 17-1. Examples of How an I2C Instruction is Implemented (for Reads and Writes) Instruction Software Command Sequence WRITE (8/16-bit address) READ (8/16-bit address) a • Write(I2CM_ADDRESS) = starting byte address to program (This step can be skipped if ADDR_SEL.ADDR_DIS=1) • Write(I2CM_BYTE) = desired transfer size • Write(I2CM_INSTRUCTION) = ‘b01 • Issue multiple Write(I2CM_WRITE_DATA) according to the BYTE. One Write(I2CM_WRITE_DATA) contains eight bytes of desired write value. Note: WDAT[7:0] contains data for the byte address 0, WDAT[63:56] contains data for the byte address 7. • Write(I2CM_ADDRESS) = starting byte address to read (This step can be skipped if ADDR_SEL.ADDR_DIS=1.) • Write(I2CM_BYTE) = desired transfer size • Write(I2CM_INSTRUCTION) = ‘b00 • Issue multiple Read (I2CM_READ_DATA) according to the BYTE. One Read (I2CM_READ_DATA) returns eight bytes of read data. Note: I2CM_READ_DATA[7:0] contains data for the byte address 0, I2C_READ_DATA[63:56] contains data for the byte address 7. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 221 Chapter 17 I2C Master Interface Table 17-1. (continued)Examples of How an I2C Instruction is Implemented (for Reads and Writes) Instruction Software Command Sequence Write (32-bit address) Disable address phase (ADDR_SEL.ADDR_DIS=1). • Write(I2CM_BYTE) = 4 (that is, 4 bytes of address) + desired transfer size • Write(I2CM_INSTRUCTION) = ‘b01 • Issue multiple Write(I2CM_WRITE_DATA) according to the BYTE. One Write(I2CM_WRITE_DATA) contains eight bytes of desired write value. The first Write(I2CM_WRITE_DATA) contains the four bytes of address. Note: WDAT[7:0] will be sent out first. Read (32-bit address) Disable address phase (ADDR_SEL.ADDR_DIS=1). • Write(I2CM_BYTE) = 4 (that is, 4 bytes of address) • Write(I2CM_INSTRUCTION) = ‘b01 (this initiates the dummy write operation) • Issue one Write(I2CM_WRITE_DATA), which contains four bytes of address during the dummy write operation, WDAT[7:0] will be sent out first. • Write(I2CM_BYTE) = desired transfer size • Write(I2CM_INSTRUCTION) = ‘b000 • Issue multiple Read (I2CM_READ_DATA) according to the BYTE. One Read (I2CM_READ_DATA) return eight bytes of read data. a.Refer to the I2CM_ADDR_SEL register for more details on 8/16-bit address, and address disable. 17.2.3 I 2 C EEPROM Page Mode The I2C EEPROM has a concept of page size, which will improve write (program) timings. If the transfer size crosses the page size boundary, EEPROM typically wraps around the address, data contends can be overwritten in an implementation dependent way. The I2C Master Interface will partition the transfer into pages, if the transfer size (I2CM_BYTE register) crosses the page size (I2CM_PAGE_SIZE register). 17.2.4 Error Handling and Interrupts It is up to the software to maintain correct boot ROM format, that is, correct encoding to specify the length of the boot code. It is up to software to maintain correct protocol in order to execute the I2C-based serial EEPROM instructions, that is, program I2CM_ADDRESS/I2CM_BYTE before programming I2CM_INSTRUCTION, read and program EEPROM must be in 4B quantities. It is up to software to issue recognizable instructions. Error conditions will be logged. For more information on how to log errors, refer to I2CMS interrupt registers. Interrupts will be sent to a binding tile once an error condition is encountered; the binding is defined by the Rshim interrupt binding register (RSH_INT_BIND). 17.3 Registers For detailed descriptions of the I2C Master registers, refer to i2cm_html. 222 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 18 I 2 C S LAVE I NTERFACE 18.1 Overview I2C Slave Interface, illustrated in Figure 18-1, is the interface to an external I2C device, where external I2C device is the initiator (master) and the I2C Slave Interface is the target (slave). The I2C interface supports Standard-mode, Fast-mode, and Fast-mode plus. The I2C slave controller supports clock stretching. For more information, refer to the I2C specification. 125MHz 100/400KHz Transmit FIFO FSM TILEs/RSHIM I2C Interface External I2C Devices (for example: Bus Controller) Receive FIFO Buffer FIFO Figure 18-1: I2C Slave Interface Block Diagram 18.2 Usage Model 18.2.1 Data Flows An external I2C master device can read and write the address space in the I2C Slave Interface. An external I2C master device can read and write the address space in a Rshim device (other than the I2C Slave Interface) via the “host interface”. For example, boot code can be pushed to the RSH_PG_DATA register in the Rshim. The I2C Slave controller uses the buffer FIFO to flow control the access to a Rshim device. The buffer FIFO is not software-addressable. A Rshim device can read and write the address space in the I 2C Slave Interface via the “register interface”, where Rshim is the initiator (master). Note that the Rshim cannot initiate requests to external I2C devices; the Rshim can only respond to external I2C devices. The I2C Slave interface implements two software addressable FIFOs to assist data passing. • I2C slave interface “pull” model: An external I2C master device can pass data to a tile via the receive FIFO, that is data is written to the receive FIFO by an external device and data is read from the receive FIFO by a tile. The receive FIFO is software addressable. A receive FIFO write event interrupt is provided, along with other receive FIFO status flags. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 223 Chapter 18 I2C Slave Interface • I2C slave interface “push” model: A tile can pass data to an external I2C master device via the transmit FIFO, that is data is pushed to the transmit FIFO by a tile and data is popped from the transmit FIFO by an external I2C master device. The transmit FIFO is software addressable. A transmit FIFO read event interrupt is provided, along with other transmit FIFO status flags. 18.2.2 Direct-Addressing The I2C Slave Interface is configured to have a 7-bit I2C slave address (I2CS_SLAVE_ADDRESS) register. The I2C Slave Interface responds to the bus only if this I2C slave address matches the targeted I2C device. If the 7-bit slave address matches, direct addressing is applied on the 16-bit address by the I2C Slave controller. Bit assignments of the 16 bits address are shown in Table 18-1. Table 18-1. I2C Slave Bit Assignments Bits Description 15:12 Rshim Channel number[3:0] 11:3 (word) Register number[8:0] 2:0 Byte address If the I2C Slave controller is selected (that is channel number matches with the I2C Slave controller), the total addressable space is 2K bytes. I2C slave controller addressing = {register number [8:0], byte address [2:0]} If a Rshim device (other than the I2C slave controller) is selected, the total address space becomes 1 M bytes. The register number can be extended by 4 bits via the I2CS_RDEV_ADDR register. Rshim device addressing = {channel number [3:0], I2CS_RDEV_ADDR[3:0], register number [8:0]} 18.2.3 No-Address Access The I2C Slave Interface is configured as direct-addressing by default. The address phases can be disabled via the I2CS_ADDR_PHASE register, where the I2C Slave Interface is considered an address-less device. In the “no-address” mode, an external I2C master device will push all data (note that all payload is treated as data, no address) to the receive FIFO and an RFIFO_WRITE interrupt will be raised when an entry is written to the receive FIFO. It is up to the TILE side software to read from the receive FIFO in a timely fashion. An external I2C master device will read all data from the transmit FIFO and an TFIFO_READ interrupt will be raised when an entry is read from the transmit FIFO. It is up to the TILE side software to write to transmit FIFO in a timely fashion. 18.2.4 8 Bits / 64 Bits Handling The I2C bus is byte-oriented, where registers in the I2C Slave Interface and the Rshim are 8-byte word-oriented. Because byte mask is not available, I2C write operations should be performed in 8byte quantities, starting from byte address 0 (for example, in the sequence of byte 0, byte 1, byte 2, … byte 7). In case of a partial write (where the transfer size is not in 8-byte quantities), 0s will be 224 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Usage Model filled in (for example, if writes are performed to byte 0 and byte 1 only, then the other bytes will be filled with 0s). The I2C Slave Interface uses a bytes address to assemble multiple of 8 bits I2C write data into a 64-bit internal write data. There is no restriction on I2C read operations. The I2C Slave Interface uses byte address to steer the 64 bits of internal read data and assembles the 8 bits I2C read data. 18.2.5 Acknowledge Control The I2C Slave Interface supports clock stretching. When the I2C Slave interface is not able to respond with read data in time, it can stretch the clock. Some external I2C master devices might not support clock stretching. Normally, the ack/nack indicates whether the I2C Slave interface acknowledges the data or not. The I2C Slave Interface provides a register to control how the read acknowledgment and the write acknowledgment are handled. Refer to the I2CS_ACK_CTL register for more details. 18.2.6 Access Arbitration Local registers in the I2C Slave Interface can be read by the Rshim and the external I 2C master device at the same time. If the Rshim and the external I2C master device try to write to the I2C Slave Interface at the same time, a higher priority is given to the external I2C master device (Note that requests from a low speed I2C master device can not be sustained; the Rshim will be able to write the I2C Slave Interface shortly afterwards). 18.2.7 Error Handling and Interrupts Error conditions will be logged, refer to interrupt status and mask (I2CS_INT_VEC_MASK) registers on how to log errors. Interrupts will be sent to a binding tile once error condition encountered. The interrupt binding is handled by RSH_INT_BIND register in the Rshim. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 225 Chapter 18 I2C Slave Interface 226 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice C HAPTER 19 SPI I NTERFACE 19.1 Overview The SPI SROM interface provides an interface for tiles to write and read an off-chip SPI SROM. It also includes a hardware state machine that can read from the SPI SROM at boot time and then write any Rshim device register. External SPI SROM sizes, between 512K and 128M bits, are supported. 125 MHz Register Interface RSHIM 15.625 MHz Write FIFO FSM SPI Interface External SPI ROM Read FIFO Host Interface RSHIM Buffer FIFO Figure 19-1: SPI SROM Interface Block Diagram 19.1.1 Boot Options The following boot options are available. • Boot request type • Both hard boot requests and soft boot requests (via special boot instruction) are supported. The hard boot request is via a bootstrap pin. • Boot code destination • A boot code segment can be sent to different destinations, for example, the first segment is sent to a UART controller in the Rshim; the second segment is sent to an I2C device in the Rshim, and the rest of segments are sent to tiles. 19.1.2 Boot ROM Format Figure 19-2 presents the boot ROM record format. The boot ROM is comprised of three regions: • An eight-byte ROM header. • One or multiple program segments. In each program segment, there is an eight-byte segment header followed by program code (in eight-byte quantities). • A user data region. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 227 Chapter 19 SPI Interface R e rsvd rev_id dest header word Boot Code 0 segment 0 Boot Code 0 . . . e rsvd dest word Boot Code n segment n Boot Code n Data Figure 19-2: SPI Boot ROM Format Table 19-1. SPI SROM Format Description Field Description Rev_id[7:0] Revision of the BOOT ROM, will be stored in a configuration register after the boot. Rsvd Reserved. Word[16:0] Specifies the number of words (8B) in the segment, including the segment header (e, dest, word). 0 n Dest[16:0] 2^17 words = 2^23 bits n words = 64 n bits Specifies the destination of the segment. [16:13] channel number [12:0] word address number 228 E 1 Specifies this segment is the end of the boot code. Boot Code Only the boot code portion will be forwarded to the specified destination. The format of boot code is transparent to the SROM controller. SROM controller does not interpret the boot code. Data The user data section of the ROM will not be processed by the controller. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Usage Model 19.2 Usage Model 19.2.1 Boot Operation A boot sequence starts when a hard or soft boot request is received by the SPI SROM controller. The SPI SROM controller always boots from address 0 in the boot ROM. The SPI SROM controller reads the boot header. The SPI SROM controller then reads program segments and processes them until a segment marked as the end segment is reached. The ROM header and the segment header(s) are not forwarded and are only used by the SPI SROM controller. 19.2.2 SPI Flash Operations 19.2.2.1SPI Flash Instructions In addition to the boot, the SROM controller supports the following instructions for the SPI based serial flash, as listed in Table 19-2. Table 19-2. SPI Flash Instructions Instructions Description Instruction Format Address Bytes Dummy Bytes Data Bytes (write) Data Bytes (read) BOOT Boot 1 0000 0000 (100h) 0 0 0 0 WREN Write enable 0 0000 0110 (06h) 0 0 0 0 WRDI Write disable 0 0000 0100 (04h) 0 0 0 0 RDID0 Read identification 0 1001 1111 (9fh) 0 0 0 1 to 3 RDID1 Read identification 0 0001 0101 (15h) 0 0 0 1 to 3 RDSR Read Status Register 0 0000 0101 (05h) 0 0 0 1 WRSR Write Status Register 0 0000 0001 (01h) 0 0 1 0 READ Read Data Bytes 0 0000 0011 (03h) 3 0 0 1 to max PP Page Program. 1 to 512 bytes can be programmed. 0 0000 0010 (02h) 3 0 1 to 512 0 SE0 Sector Erase. Erase one sector in memory array. 0 1101 1000 (d8h) 3 0 0 0 SE1 Sector Erase. Erase one sector in memory array. 0 0101 0010 (52h) 3 0 0 0 BE0 Bulk Erase. Erase all sectors in memory array. 0 1100 0111 (c7h) 0 0 0 0 BE1 Bulk Erase. Erase all sectors in memory array. 0 0110 0010 (62h) 0 0 0 0 DP Deep power down. 0 1011 1001 (b9h) 0 0 0 0 RES Release from deep power-down and Read Electronic Signature 0 1010 1011 (abh) 0 3 0 1 to max RES Release from deep power-down. 0 0 0 0 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 229 Chapter 19 SPI Interface 19.2.2.2SPI Configurable Instruction Sets The SPI SROM controller supports a configurable instruction sets. Refer to the configurable Instruction Code registers for more details. Each instruction code defines a sequence of operations associated with the instruction, for example, number of address bytes, number of dummy bytes, number of write data bytes, and number of read data bytes. 19.2.2.3SPI Flash Unknown Instruction For any instruction not defined in the configurable instruction code registers, the SPI SROM controller simply sends the unknown instruction (bit 7 to bit 0) to the external device. Therefore, it is up to the software to carefully manage the instructions. An interrupt is generated when any unknown instructions are encountered. 19.2.2.4SPI Flash Deep Power-Down The external SPI SROM can be put into deep power-down mode (by executing the DP instruction). Once it is in deep power-down, all instructions, except the RES instruction, are ignored. There is no status bit to determine whether or not the current instruction is ignored. It is up to the software to detect whether or not SROM is in deep power-down state. 19.2.2.5SPI Flash Write In-Progress Once the external SROM is not in deep power-down mode, all attempts (except RDSR) to access the memory array during a WRSR cycle, PP cycle or Erase cycle (SE or BE), are ignored, and the internal WRSR cycle, PP cycle or Erase cycle, continues unaffected. A status register tracks the activities of the external SROM. This register can be read at any time, even while a program, erase or write status register cycles are in progress. It is strongly recommended to check the Write In Progress (WIP) bit before sending a new instruction to the SROM device. Bits Name Description 7 SRWD Status register write disable bit. 6 Reserved Read as 0 5 Reserved Read as 0 4 BP2 3 BP1 Block Protect Bits. These bits define the size of the area to be software-protected against Program and Erase instructions. 2 BP0 1 WEL Write Enable Latch. 0 are When this bit is set to 0, the Write Status Register, Program or Erase instructions ignored. 0 WIP Write In-Progress Bit. This bit indicates if the memory is busy with a Write Status Register, Program, or Erase cycle. When read as 1, one of these cycles is in progress. 19.2.2.6SPI Flash Write Protection SPI Flash has several protection schemes, the write protection pin (this is an off-chip function, for example a jumper on board), status register protection (SRWD), block protect bits (BP) and the write enable (WEL). Refer to memory vendor datasheet for more details. The software manages the write protection (for example WEL must be enabled before a program or erase instruction can be executed, the target area is not protected by the BP bits). 230 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Usage Model 19.2.2.7SPI Flash Page Mode For optimized timings, it is recommended to use the PP instruction to program all consecutive targeted bytes in a single sequence (up to the page size) versus using several PP sequences with each containing only a few bytes. Note that the transfer size should not cross the page size boundary. 19.2.2.8SPI Flash Interface The SPI interface provides four external pins. Only SPI Mode 0 clocking is supported. • spi_clk (SROM_SPI_SCK) (15.625MHz serial clock) • SROM_SPI_CS_N (chip select) • SROM_SPI_MOSI (mast out serial in, to serial flash) • SROM_SPI_MISO (master in serial out, from serial flash) 19.2.2.9Software Command Sequences to Execute an SPI Flash Instruction In general, there are three steps involved in executing an SPI Flash instruction in Table 19-1. Step 1 and/or step 3 might not be necessary for certain SPI Flash instructions. 1. Program the SROM SPI Address (SROM_ADDRESS) register and program the SROM SPI Byte register. 2. Program the SROM SPI Instruction (SROM_INSTRUCTION) register to start the operation. 3. Data handling by either writing to the SROM SPI Write Data (SROM_WRITE_DATA) register or by reading from the SROM SPI Read Data (SROM_READ_DATA) register. Before you can program the SROM SPI Instruction (SROM_INSTRUCTION) register, the software must to ensure that the SROM controller is idle. There are two usage models: • Interrupt When the INST_EXEC interrupt (in the SROM_INT_VEC_W1TC register) is asserted, it means the previous instruction is done. Software first clears the INST_EXEC interrupt bit, and then writes to SROM SPI Instruction (SROM_INSTRUCTION) register. • Status Polling Software can poll the BUSY bit of the SROM_FLAG register, and write to the SROM SPI Instruction (SROM_INSTRUCTION) register when BUSY=0. Several instructions, including WRSR and PP, involve one or multiple bytes of write data. All types of write data will be programmed by writing the SROM SPI Write Data (SROM_WRITE_DATA) register. Each write pushes one entry to the write FIFO. To write to the SROM SPI Write Data register, software makes sure that the write FIFO is not full. There are two usage models: • Interrupt When the WFIFO_READ interrupt (of the SROM_INT_VEC_W1TC) is asserted, it means that previous write data has been processed. Software first clears the WFIFO_READ interrupt bit. Software then writes to the SROM SPI Write Data (SROM_WRITE_DATA) register, based on the status of the write FIFO (SROM_FLAG register). For example, the write FIFO is full if WFIFO_FULL=1; the write FIFO has one data entry if WFIFO_FULL=0 and WFIFO_EMPTY=0; the write FIFO is empty if WFIFO_EMPTY=1. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 231 Chapter 19 SPI Interface • Status Polling Software can poll the SROM_FLAG register, and write to the SROM SPI Write Data (SROM_WRITE_DATA) register when the write FIFO is not full. Several instructions, including RDSR, RDID0/RDID1, READ and RES, involve one or multiple bytes of read data. All types of read data will be returned by reading the SROM SPI Read Data (SROM_READ_DATA) register. Each read returns (pops) one entry from the read FIFO. To read from the SROM SPI Read Data (SROM_READ_DATA) register, software ensure that the read FIFO is not empty. There are two usage models: • Interrupt When the RFIFO_WRITE interrupt is asserted, it means new read data is available. Software first clears the RFIFO_WRITE interrupt bit. Software then reads the SROM SPI Read Data register based on the status of the read FIFO (SROM_FLAG register). • Status Polling Software can poll the SROM_FLAG register and read from the SROM SPI Read Data (SROM_READ_DATA) register when the read FIFO is not empty. Table 19-1 shows examples of how an SPI Flash instruction will be implemented by a sequence of software reads and writes. For simplicity, the detailed steps of how to write the SROM SPI Instruction (SROM_INSTRUCTION) register is marked as Write(SROM_INSTRUCTION). The detailed steps of how to write the SROM SPI Write Data (SROM_WRITE_DATA) register is marked as Write(SROM_WRITE_DATA). The detailed steps of how to read the SROM SPI Read Data (SROM_READ_DATA) register is marked as Read(SROM_READ_DATA). Table 19-1. SPI Flash Implementation Instructions Instruction Software Command Sequence RDSR • Write(SROM_INSTRUCTION) = 05h • Read(SROM_READ_DATA) returns status register once. Note: Status returns on SROM_READ_DATA[7:0]. WRSR • • • • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. Make sure write is enabled (for example execute the WREN instruction) Write(SROM_INSTRUCTION) = 01h Write(SROM_WRITE_DATA) = desired status setting. Note: SROM_WRITE_DATA[7:0] contains the status setting. WREN • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. • Write(SROM_INSTRUCTION) = 06h WRDI • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. • Write(SROM_INSTRUCTION) = 04h RDID0/RDID1 • • • • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. Write(SROM_BYTE) = 1/2/3 Write(SROM_INSTRUCTION) = 9fh or 15h Read(SROM_READ_DATA) returns RDID. Note: Typically RDI has one byte of manufacturer identification (stored in SROM_READ_DATA[23:16]), and two bytes of device identification, for example memory type and memory density (stored in SROM_READ_DATA[15:0]). 232 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Usage Model Table 19-1. SPI Flash Implementation Instructions (continued) Instruction Software Command Sequence READ • • • • • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. Write(SROM_ADDRESS) = starting byte address to read Write(SROM_BYTE) = desired transfer size (must be in 8B quantities). Write(SROM_INSTRUCTION) = 03h Perform multiple Read(SROM_READ_DATA) according to the SROM_BYTE. One Read(SROM_READ_DATA) returns eight bytes of read data. Note: SROM_READ_DATA[7:0] contains data for the lowest address, SROM_READ_DATA[63:56] contains data for the highest address. PP • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. • Make sure write is enabled and the targeted area is not protected (for example execute the WREN instruction or WRSR instruction) • Write(SROM_ADDRESS) = starting byte address to program • Write(SROM_BYTE) = desired transfer size (must be in 8B quantities and should not cross the page size boundary specified in the SROM hardware specification). • Write(SROM_INSTRUCTION) = 02h • Perform multiple Write(SROM_WRITE_DATA) according to the SROM_BYTE. One Write(SROM_WRITE_DATA) contains eight bytes of desired write value. Note: SROM_WRITE_DATA[7:0] contains data for the lowest address, SROM_WRITE_DATA[63:56] contains data for the highest address. SE0/SE1 • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. • Make sure write is enabled and the targeted area is not protected (for example execute the WREN instruction or WRSR instruction) • Write(SROM_ADDRESS) = any byte address within the sector to be erased. • Write(SROM_INSTRUCTION) = d8h or 52h. BE0/BE1 • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. • Make sure write is enabled and the targeted area is not protected (for example execute the WREN instruction or WRSR instruction) • Write(SROM_INSTRUCTION) = c7h or 62h DP • Poll SROM status register (via an RDSR instruction) until the WIP bit is clear. • Write(SROM_INSTRUCTION) = b9h RES • Write(SROM_BYTE) = 0/3, 3 for read electronic signature • Write(SROM_INSTRUCTION) = abh • Read(SROM_READ_DATA) returns electronic signature, if desired. Note: SROM_READ_DATA[7:0] contains the electronic signature. 19.2.2.10Interface Timing Table 19-2 lists of timing arcs that have been implemented by hardware (assuming 125MHz reference clock). The program and erase operations may take long time, and the WIP bit is used to check whether or not the operation has finished. Table 19-2. Timing Arcs Implemented by Hardware Symbol Timing Parameter Hardware Implemented Value tSHSL Chip deselect time 240 ns tDP Chip high to Deep Power-down mode 6.4 us tRES1 Chip high to standby power mode without Electronic signature read 64 us Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 233 Chapter 19 SPI Interface Table 19-2. Timing Arcs Implemented by Hardware (continued) Symbol Timing Parameter Hardware Implemented Value tRES2 Chip high to standby power mode without Electronic signature read 64 us tSLCH Chip select active setup time 44 ns tCHSL Chip select not active hold time 44 ns tDVCH Master out and slave in setup time 12 ns tCHDX Master out and slave in hold time 12 ns tCHSH Chip select active hold time 44 ns tSHCH Chip select not active setup time 44 ns Write protection is supported via the status registers. External SROM devices can support write protection and hold pins, which are supported by board functions. 19.2.3 Rshim Interface 19.2.3.1Rshim Register Interface The SROM controller is accessed by a register interface. Write will be acknowledged once the previous write is consumed. Read will be acknowledged once read data is returned from external SROM. If there is no read data to be returned, the read will be acknowledged immediately (with 0s), and an interrupt will be generated. 19.2.3.2Rshim Host Interface The host interface is used during the boot sequence. Once the boot code is read from external SROM, it is first stored in a small boot FIFO. The host interface assembles the channel number, register number, together with the boot code. A simple busy and acknowledge protocol is applied between this interface and Rshim on the other side. 19.2.3.3Error Handling and Interrupts It is up to the software to maintain the correct boot ROM format, that is correct encoding to specify the length of the boot code. It is up to software to maintain correct protocol in order to execute the SPI based serial flash instructions, that is program SROM_ADDRESS/SROM_BYTE before programming SROM_INSTRUCTION, read and program SROM must be in 8B quantities. It is up to software to issue recognizable instructions. Error conditions will be logged, refer to SROM interrupt registers on how to log errors. Interrupts will be sent to a binding tile once error condition encountered, the binding is defined by the Rshim interrupt binding (RSH_INT_BIND) register. 234 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice A PPENDIX A: JTAG I NTERFACE The TILE-Gx processor supports standard boundary scan and is 1149.1-compliant. A BSDL file is available. In addition to standard 1149.1 registers, the TILE-Gx JTAG interface provides instructions to read and write a number of internal data registers distributed throughout the processor. Access to the JTAG controller is shared between the JTAG I/O pins and an internal JTAG controller. In order to enable access from the JTAG I/O pins, the RJC pin must be deasserted. For more information, refer to TILE-Gx36 Data Sheet (DS400). Most of the JTAG registers are for test purposes only, however the Rshim data register supports read and write transactions to the RSHIM MMIO registers. The JTAG instruction for Rshim access is 27 bits long and has the value 0x02C009A. Writing the Rshim access data register can initiate a read or write operation to the Rshim MMIO registers. Reading the Rshim access data register supplies status and data from the previous read or write operation. A data written into the Rshim data register has the following format: Table A-1. Write Data Format Bits Function Description 1:0 CMD 0 1 2 3 65:2 DATA Data to be written if write operation. 78:66 REG Offset of register to be accessed. 82:79 CHAN Channel to be accessed. Nop. Read. Write. Reserved. The data read from the Rshim data register has the following format: Table A-2. Read Data Format Bits Function Description 1:0 STATUS 0 3 65:2 DATA Register data if the operation was a read and has completed. 85:66 PAD Pad data. Read/Write complete. Read/Write incomplete. The STATUS field of the data register can be used to determine of the Rshim access function is idle or if an operation is being performed at that time. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 235 Appendix A: JTAG Interface Read operations are accomplished by shifting in the READ command into the Rshim data register and updating it. Then the Rshim data register can capture the read data, and it can be shifted out. In the unlikely event that the read transaction had not completed by the time the data was captured, the status will be reported as 3, and another read can be attempted. Write operations can poll the STATUS field to ensure that previous operations are complete before starting the next operation. 236 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice APPENDIX B: CLASSIFIER INSTRUCTIONS AND SPRS 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 destReg Immediate Op mfspr/mtspr 8 7 6 5 4 opcode Branch 3 2 1 0 srcB Reserved Register Op immediate Reserved srcA spr_idx target Reserved Reserved Jump Figure B-1: Classifier Instruction Format Table B-1. destReg, srcA, and srcB Encodings RegisterNumber Name/Description As Destination As Sourcea 21:0 GPRs. General Purpose Registers. 22 hPtr 23 tbl[tPtr++(1)] (mem1) pdPtr 24 tbl[tPtr++(2)] (mem2) tPtr 25 iHdr[hPtr] (peek2) 26 iHdr[hPtr++(2)] (get2) 27 Hash0_lo 28 Hash0_hi 29 Hash1 lo 30 Hash1 hi 31 NULL pDesc[pdPtr++] (put2, inc1) pDesc[pdPtr++(2)] (put2, inc2) a When only one description is provided, the meaning is the same for Source and Destination. B.1 Classifier Instructions The classifier instructions, listed below, are defined in the following sections: • Arithmetic Instructions • Comparison Instructions • Control Instructions Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 237 Appendix B: Classifier Instructions and SPRs • Logical Instructions • Miscellaneous Instructions B.1.1 Arithmetic Instructions Table B-2. Arithmetic Instructions 238 Instruction Example Opcode Function Description ADD add r3,r1,r2 62 rf[Dest] = rf[SrcA] + rf[SrcB]; Add two words together. ADDI addi r3,r1,-5 63 rf[Dest] = rf[SrcA] + Imm16; Add the contents of a register and an immediate. CSUM csum r3,r1,r2 61 rf[Dest] = csum(rf[SrcA], rf[SrcB]); Compute the checksum of two words. CSUM_HASH0_ACC2 csum_hash0 _acc2 r3,r1,r2 48 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum0 = crc32_16(Accum0, rf[SrcB]); rf[Dest] = csum(tmpA,tmpB);} Compute the checksum of two words and accumulate the CRC of one of the words. CSUM_HASH0_SEED2 csum_hash0 _seed2 r3,r1,r2 52 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum0 = crc32_16(0xffffffff, rf[SrcB]); rf[Dest] = csum(tmpA, tmpB);} Compute the checksum of two words and CRC of one of the words with a seed into the accumulator. CSUM_HASH1_ACC2 csum_hash1 _acc2 r3,r1,r2 49 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum1 = crc32_16(Accum1, rf[SrcB]); rf[Dest] = csum(tmpA, tmpB);} Compute the checksum of two words and accumulate the CRC of one of the words. CSUM_HASH1_SEED2 csum_hash1 _seed2 r3,r1,r2 53 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum1 = crc32_16(0xffffffff, rf[SrcB]); rf[Dest] = csum(tmpA, tmpB);} Compute the checksum of two words and CRC of one of the words with a seed into the accumulator. CSUM_HASH16_0 csum_hash16_ 0 r3,r1,r2 48 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum0 = crc32_16(Accum0, rf[SrcB]); rf[Dest] = csum(tmpA,tmpB);} Compute the checksum of two words and accumulate the CRC of one of the words. CSUM_HASH16_1 csum_hash16_ 1 r3,r1,r2 49 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum1 = crc32_16(Accum1, rf[SrcB]); rf[Dest] = csum(tmpA, tmpB);} Compute the checksum of two words and accumulate the CRC of one of the words. CSUM_HASH8_0 csum_hash8_0 r3,r1,r2 50 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum0 = crc32_8(Accum0, (rf[SrcB] & 0xff)); rf[Dest] = csum(tmpA, (tmpB & 0xff));} Compute the checksum of two words and accumulate the CRC of one of the words. CSUM_HASH8_1 csum_hash8_1 r3,r1,r2 51 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum1 = crc32_8(Accum1, (rf[SrcB] & 0xff)); rf[Dest] = csum(tmpA, (tmpB & 0xff));} Compute the checksum of two words and accumulate the CRC of one of the words. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Classifier Instructions Table B-2. Arithmetic Instructions (continued) Instruction Example Opcode Function Description CSUM_HASHS16_0 csum_hashs16 _0 r3,r1,r2 52 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum0 = crc32_16(0xffffffff, rf[SrcB]); rf[Dest] = csum(tmpA, tmpB);} Compute the checksum of two words and CRC of one of the words with a seed into the accumulator. CSUM_HASHS16_1 csum_hashs16 _1 r3,r1,r2 53 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum1 = crc32_16(0xffffffff, rf[SrcB]); rf[Dest] = csum(tmpA, tmpB);} Compute the checksum of two words and CRC of one of the words with a seed into the accumulator. CSUM_HASHS8_0 csum_hashs8_ 0 r3,r1,r2 54 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum0 = crc32_8(0xffffffff, (rf[SrcB] & 0xff)); rf[Dest] = csum(tmpA, (tmpB & 0xff));} Compute the checksum of two words and CRC of one of the words with a seed into the accumulator. CSUM_HASHS8_1 csum_hashs8_ 1 r3,r1,r2 55 {unsigned int tmpA = rf[SrcA], tmpB = rf[SrcB]; Accum1 = crc32_8(0xffffffff, (rf[SrcB] & 0xff)); rf[Dest] = csum(tmpA, (tmpB & 0xff));} Compute the checksum of two words and CRC of one of the words with a seed into the accumulator. SHL1ADD shl1add r3,r1,r2 16 rf[Dest] = (rf[SrcA] << 1) + rf[SrcB]; Shifts the first operand left by one bit and then adds the second source operand. SHL1ADDI shl1addi r3,r1,7 17 rf[Dest] = (rf[SrcA] << 1) + Imm16; Shifts the first operand left by one bit and then adds a 16-bit immediate. SUB sub r3,r1,r2 60 rf[Dest] = rf[SrcA] - rf[SrcB]; Subtract one word from another. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 239 Appendix B: Classifier Instructions and SPRs B.1.2 Comparison Instructions Table B-3. Comparison Instructions 240 Instruction Example Opcode Function Description CMPEQ cmpeq r3,r1,r2 44 rf[Dest] = ((signed short)rf[SrcA] == (signed short)rf[SrcB]) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first source operand is equal to the second source operand. Otherwise, set the destination register to 0x0000. CMPEQI cmpeqi r3,r1,0x87 45 rf[Dest] = ((signed short)rf[SrcA] == (signed short)Imm16) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first source operand is equal to the 16-bit immediate. Otherwise, set the destination register to 0x0000. CMPLES cmples r3,r1,r2 40 rf[Dest] = ((signed short)rf[SrcA] <= (signed short)rf[SrcB]) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first source operand is less than or equal to the second source operand. Otherwise, set the destination register to 0x0000. CMPLESI cmplesi r3,r1,7 42 rf[Dest] = ((signed short)rf[SrcA] <= (signed short)Imm16) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first source operand is less than or equal to the 16-bit immediate. Otherwise, set the destination register to 0x0000. CMPLEU cmpleu r3,r1,r2 41 rf[Dest] = (rf[SrcA] <= rf[SrcB]) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first unsigned source operand is less than or equal to the second unsigned source operand. Otherwise, set the destination register to 0x0000. CMPLEUI cmpleui r3,r1,7 43 rf[Dest] = (rf[SrcA] <= Imm16) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first unsigned source operand is less than or equal to the unsigned 16-bit immediate. Otherwise, set the destination register to 0x0000. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Classifier Instructions Table B-3. Comparison Instructions (continued) Instruction Example Opcode Function Description CMPLTS cmplts r3,r1,r2 36 rf[Dest] = ((signed short)rf[SrcA] < (signed short)rf[SrcB]) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first source operand is less than the second source operand. Otherwise, set the destination register to 0x0000. CMPLTU cmpltu r3,r1,r2 37 rf[Dest] = (rf[SrcA] < rf[SrcB]) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first unsigned source operand is less than the second unsigned source operand. Otherwise, set the destination register to 0x0000. CMPNE cmpne r3,r1,r2 46 rf[Dest] = ((signed short)rf[SrcA] != (signed short)rf[SrcB]) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first source operand is not equal to the second source operand. Otherwise, set the destination register to 0x0000. CMPNEI cmpnei r3,r1,0x87 47 rf[Dest] = ((signed short)rf[SrcA] != (signed short)Imm16) ? 0x0001 : 0x0000; Set the destination register to 0x0001 if the first source operand is not equal to the 16-bit immediate. Otherwise, set the destination register to 0x0000. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 241 Appendix B: Classifier Instructions and SPRs B.1.3 Control Instructions Table B-4. Control Instructions 242 Instruction Example Opcode Function Description BGEZ bgez r1,0x1000 10 if ((signed short)rf[SrcA] >= 0) { setNextPC(Imm16); delay += 2; taken = true; } else { setNextPC(pc + 1); } Branch to the target address if the source operand is greater than or equal to zero. BGEZT bgezt r1,0x1000 11 if ((signed short)rf[SrcA] >= 0) { setNextPC(Imm16); delay += 1; taken = true; } else { setNextPC(pc + 1); delay += 2; } Branch to the target address if the source operand is greater than or equal to zero. Provide the hardware a hint that the branch will be taken. BLBC blbc r1,0x1000 14 if (!(rf[SrcA] & 0x1)) { setNextPC(Imm16); delay += 2; taken = true; } else { setNextPC(pc + 1); } Branch to the target address if Bit 0 of the source operand is 0. BLBCT blbct r1,0x1000 15 if (!(rf[SrcA] & 0x1)) { setNextPC(Imm16); delay += 1; taken = true; } else { setNextPC(pc + 1); delay += 2; } Branch to the target address if Bit 0 of the source operand is 0. Provide the hardware a hint that the branch will be taken. BLBS blbs r1,0x1000 12 if (rf[SrcA] & 0x1) { setNextPC(Imm16); delay += 2; taken = true; } else { setNextPC(pc + 1); } Branch to the target address if Bit 0 of the source operand is 1. BLBST blbst r1,0x1000 13 if (rf[SrcA] & 0x1) { setNextPC(Imm16); delay += 1; taken = true; } else { setNextPC(pc + 1); delay += 2; } Branch to the target address if Bit 0 of the source operand is 1. Provide the hardware a hint that the branch will be taken. BLTZ bltz r1,0x1000 8 if ((signed short)rf[SrcA] < 0) { setNextPC(Imm16); delay += 2; taken = true; } else { setNextPC(pc + 1); } Branch to the target address if the source operand is less than zero. BLTZT bltzt r1,0x1000 9 if ((signed short)rf[SrcA] < 0) { setNextPC(Imm16); delay += 1; taken = true; } else { setNextPC(pc + 1); delay += 2; } Branch to the target address if the source operand is less than zero. Provide the hardware a hint that the branch will be taken. JR jr r1 5 setNextPC(rf[SrcA]); delay += 2; Jump to the address in the source register. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Classifier Instructions B.1.4 Logical Instructions Table B-5. Logical Instructions Instruction Example Opcode Function Description AND and r3,r1,r2 22 rf[Dest] = rf[SrcA] & rf[SrcB]; Compute the logical AND of two words. ANDI andi r3,r1,0x0f 23 rf[Dest] = rf[SrcA] & Imm16; Compute the logical AND of a word and a 16-bit immediate. OR or r3,r2,r1 24 rf[Dest] = rf[SrcA] | rf[SrcB]; Compute the logical OR of two words. ORI ori r3,r1,0x0f 25 rf[Dest] = rf[SrcA] | Imm16; Compute the logical OR of a word and a 16-bit immediate. ROTL rotl r3,r1,r2 34 rf[Dest] = (rf[SrcA] >> (16 (rf[SrcB] & 0xf))) | (rf[SrcA] << (rf[SrcB] & 0xf)); Rotate a word left by the number of bits specified in the second source operand. ROTLI rotli r3,r1,r2 35 rf[Dest] = ((unsigned short)rf[SrcA] >> (16 - (Imm16 & 0xf))) | ((unsigned short)rf[SrcA] << (Imm16 & 0xf)); Rotate a word left by the number of bits specified in an unsigned 16-bit immediate. SHL shl r3,r1,r2 28 rf[Dest] = rf[SrcA] << (rf[SrcB] & 0xf); Left-shift a word by the number of bits specified in the second source register. SHLI shli r3,r1,7 29 rf[Dest] = rf[SrcA] << (Imm16 & 0xf); Left-shift a word by the number of bits specified in an unsigned 16-bit immediate. SHRS shrs r3,r1,7 32 { unsigned int sign = rf[SrcA] & 0x8000; unsigned int tmp = rf[SrcA]; for (int i = 0; i < (rf[SrcB] & 0xf); i++) { tmp >>= 1; tmp |= sign; } rf[Dest] = tmp; } Right-shift a word by the number of bits specified in the second source register. Upper bits are filled with the value in the most-significant bit of the first source operand. SHRSI shrsi r3,r1,7 33 { unsigned int sign = rf[SrcA] & 0x8000; unsigned int tmp = rf[SrcA]; for (unsigned int i = 0; i < (Imm16 & 0xf); i++) { tmp >>= 1; tmp |= sign; } rf[Dest] = tmp; } Right-shift a word by the number of bits specified in an unsigned 16-bit immediate. Upper bits are filled with the value of the most-significant bit of the first source operand. SHRU shru r3,r1,r2 30 rf[Dest] = rf[SrcA] >> (rf[SrcB] & 0xf); Right-shift a word by the number of bits specified in the second source register. Upper bits are filled with zeros. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 243 Appendix B: Classifier Instructions and SPRs Table B-5. Logical Instructions (continued) 244 Instruction Example Opcode Function Description SHRUI shrui r3,r1,7 31 rf[Dest] = rf[SrcA] >> (Imm16 & 0xf); Right-shift a word by the number of bits specified in an unsigned 16-bit immediate. Upper bits are filled with zero. XOR xor r3,r1,r2 26 rf[Dest] = rf[SrcA] ^ rf[SrcB]; Compute the logical XOR of two words. XORI xori r3,r1,0x0f 27 rf[Dest] = rf[SrcA] ^ Imm16; Compute the logical XOR of a word and a 16-bit immediate. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Classifier Instructions B.1.5 Miscellaneous Instructions Table B-6. Miscellaneous Instructions Instruction Example Opcode Function Description CTZ ctz r3,r1 21 { unsigned int counter; for(counter = 0; counter < 16; counter++){ if( (rf[SrcA] >> counter ) & 0x1){ break; } } rf[Dest] = counter; } Returns the number trailing zeros in a word before a bit is set (1). This instruction scans the input word from the least significant bit to the most significant bit. The result of this operation can range from 0 to 16. MFSPR mfspr r6,0x5 3 rf[Dest] = sprf[SprIdx]; Move a word from the special purpose register indexed by an immediate. MTSPR mtspr 0x5,r1 2 sprf[SprIdx] = rf[SrcA]; Move a word to the special purpose register indexed by an immediate. REDMG4 redmg4 r3,r2,r1 59 rf[Dest] = (crc8_16(0xff,rf[SrcA]) & 0xf) | (rf[SrcB] & 0xfff0); Performs an 8-bit CRC reduction on a 16 bit input (keeping the low 4 bits) and merges with the upper 12-bits of the other input. REDMG6 redmg6 r3,r2,r1 58 rf[Dest] = (crc8_16(0xff,rf[SrcA]) & 0x3f) | (rf[SrcB] & 0xffc0); Performs an 8-bit CRC reduction on a 16 bit input (keeping the low 6 bits) and merges with the upper 10-bits of the other input. REDMG8 redmg8 r3,r2,r1 57 rf[Dest] = (crc8_16(0xff,rf[SrcA]) & 0xff) | (rf[SrcB] & 0xff00); Performs an 8-bit CRC reduction on a 16 bit input (keeping the low 8 bits) and merges with the upper 8-bits of the other input. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 245 Appendix B: Classifier Instructions and SPRs B.2 Registers B.2.1 Register Summary Table B-7. Register Summary 246 Address Register 0000 “CLASSIFIER HEADER FLAGS (CLASSIFIER_HEADER_FLAGS: 0X0000)” on page 247 0001 “CLASSIFIER CHANNEL (CLASSIFIER_CHANNEL: 0X0001)” on page 248 0002 “CLASSIFIER CURRENT PACKET SIZE (CLASSIFIER_L2_SIZE: 0X0002)” on page 249 0003 “CLASSIFIER PASS (CLASSIFIER_PASS: 0X0003)” on page 250 0004 “CLASSIFIER BUDGET (CLASSIFIER_BUDGET: 0X0004)” on page 251 0006 “CLASSIFIER CONTROL (CLASSIFIER_CTL: 0X0006)” on page 252 0007 “CLASSIFIER RAND (CLASSIFIER_RAND: 0X0007)” on page 253 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Registers B.2.2 Register Definitions CLASSIFIER HEADER FLAGS (CLASSIFIER_HEADER_FLAGS: 0X0000) This register contains information about the iHdr. 3.7B(55 &(55 7581& 0( +9/' 5HVHUYHG[ &7 Figure B-2: CLASSIFIER_HEADER_FLAGS Register Table B-8. CLASSIFIER_HEADER_FLAGS Register Bit Descriptions Bits Name Type Reset Description 15 CT RO 0 If asserted, packet was cut-through. This bit will be copied into the descriptor. 5:4 HVLD RO 0 Valid Headers. Provides a count of the number of valid headers buffered for the classifier. Since the classifier stalls on MFSPR to this SPR if the header is not valid, the only values the program will ever see are 1 or 2. 3 ME RO 0 MAC Error. If asserted, packet from MAC was terminated with a MAC-error indicator. This bit will be copied into the descriptor. Note that this bit is not valid on packets that are cutting through. 2 TRUNC RO 0 If asserted, packet was truncated due to insufficient storage in packet buffer. This bit will be copied into the descriptor. Note that this bit is not valid on packets that are cutting through. 1 CERR RO 0 L2 CRC Error. If asserted, packet had an L2 CRC error. This bit will be copied into the descriptor. Note that this bit is not valid on packets that are cutting through. 0 PKT_ERR RO 0 Packet Error. This field indicates that the packet had a packet error. It is equivalent to (CERR || TRUNC || ME). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 247 Appendix B: Classifier Instructions and SPRs CLASSIFIER CHANNEL (CLASSIFIER_CHANNEL: 0X0001) This register indicates the source channel for the original packet. &+$11(/ 5HVHUYHG[ Figure B-3: CLASSIFIER_CHANNEL Register Table B-9. CLASSIFIER_CHANNEL: 0x0001 Register Bit Descriptions 248 Bits Name Type Reset Description 4:0 CHANNEL RO 0 Channel for the original packet. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Registers CLASSIFIER CURRENT PACKET SIZE (CLASSIFIER_L2_SIZE: 0X0002) This register indicates the number of bytes in the current packet not including SFD, the preamble, or CRC. If the CT bit of the CLASSIFIER_HEADER_FLAGS register is asserted, this register indicates the number of bytes received before classification. /B6,=( 5HVHUYHG[ Figure B-4: CLASSIFIER_L2_SIZE Register Table B-10. CLASSIFIER_L2_SIZE Register Bit Descriptions Bits Name Type Reset Description 13:0 L2_SIZE RO 0 Number of bytes in L2, or number of bytes received before classification. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 249 Appendix B: Classifier Instructions and SPRs CLASSIFIER PASS (CLASSIFIER_PASS: 0X0003) This register is used in an implementation-specific way for communication between the simulator and tile software. It shares the same 16-bit storage area as CLASSIFIER_FAIL and CLASSIFIER_DONE. This data is also visible to Tile software via I/O configuration and can be used to exchange data between the Classifier and Tile software. '$7$ Figure B-5: CLASSIFIER_PASS Register Table B-11. CLASSIFIER_PASS Register Bit Descriptions 250 Bits Name Type Reset Description 15:0 DATA RW 0 Implementation-specific data. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Registers CLASSIFIER BUDGET (CLASSIFIER_BUDGET: 0X0004) This register provides the current cycle budget information. &17 5HVHUYHG[ %(/2:B+:0 Figure B-6: CLASSIFIER_BUDGET Register Table B-12. CLASSIFIER_BUDGET Register Bit Descriptions Bits Name Type Reset Description 15 BELOW_HWM RO 0 When asserted, the classifier header queue is below the high water mark and the cycle budget will be ignored. When clear, the cycle budget counter reaching zero will cause the current packet processing to be terminated. 10:0 CNT RO 0 Current budget cycle count. When this count reaches zero and the BELOW_HWM bit is clear, the processing of the current packet will be terminated and the default DEST, NOTIFRING, and STACK fields from the CLS_CTL register will be applied. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 251 Appendix B: Classifier Instructions and SPRs CLASSIFIER CONTROL (CLASSIFIER_CTL: 0X0006) This register contains control bits for the classifier. )5= 7,17 5HVHUYHG[ Figure B-7: CLASSIFIER_CTL Register Table B-13. CLASSIFIER_CTL Register Bit Descriptions 252 Bits Name Type Reset Description 1 TINT WO 0 When written with a 1, send an interrupt to the tile. Always reads zero. 0 FRZ WO 0 When written with a 1, freeze the classifier. Always reads zero. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Registers CLASSIFIER RAND (CLASSIFIER_RAND: 0X0007) This register contains a pseudo-random value. 9$/ Figure B-8: CLASSIFIER_RAND Register Table B-14. CLASSIFIER_RAND Register Bit Descriptions Bits Name Type Reset Description 15:0 VAL RW 0 Value. This bit provides a pseudo-random number based on an LFSR. 0 Advances on each read when the RAND_MODE bit of CLS_CTL configuration register is zero. Free running when the RAND_MODE bit of the CLS_CTL register is one. If written with 0, VAL will the 1 never advance. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 253 Appendix B: Classifier Instructions and SPRs 254 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice A PPENDIX C: M ISCELLANEOUS A CCELERATOR SPECIFICATIONS This appendix provides additional information about the four types of accelerators included with the TILE-Gx family of processors. These are: • C.1 SNOW-3G Engines • C.2 KASUMI Engines • C.3 Packet Processor — Programming • C.5 Public Key Accelerator (PKA) C.1 SNOW-3G Engines C.1.1 Specification Summary The SNOW-3G Engine supports the following features: • Fully supports SNOW 3G • Supported key size: 128-bit • Key scheduling hardware • Supported modes: UEA2, UIA2, 128-EEA1 and 128-EIA1 Note: • These latter four modes are all the available encryption and integrity algorithms defined for SNOW 3G within 3GPP. Fully synchronous design Note: The SNOW 2.0 algorithm as defined in [SNOW2] is similar to 3GPP SNOW algorithm as defined in [SNOW-3G], however not equal. This means that it is not possible to perform SNOW2.0 [SNOW2] operation with this SNOW 3G core. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 255 Appendix C: Miscellaneous Accelerator Specifications SNOW Engine iv [127:0] key [127:0] Key Stream Generation mode [1:0] Engine Control data [31:0] length [15:0] Feedback Modes data [31:0] Figure C-1: SNOW-3G Block Diagram C.1.2 Performance C.1.2.1 Introduction The following sections specify the performance of the SNOW-3G. For all numbers in this chapter, it is assumed that the engine is kept fully utilized, that is the host is supplying input blocks and retrieving output blocks in such a way that the engine never needs to wait for input and that the previous result has been retrieved before the next output becomes ready. In the first table of each section, the “cycles per block”, “bits/cycle” and “throughput at maximum frequency” numbers provided do not apply for the first block after selection of a new key and/or mode of operation. The second table in each section gives the extra clock cycles required for changing a context (key and/or mode). For each new key and/or IV, the LSFR of the key generation module requires 32 cycles to start up. Note that for the authentication modes (UIA2 and 128-EIA1), five basic 32-bit SNOW operations are required per authentication operation, of which two need to be executed beforehand. The other three are performed during the authentication of the message. The authentication operation itself is based on a 64-bit polynomial multiplication. 256 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice SNOW-3G Engines Performance of the SNOW Engine Table C-83 lists the performance of the SNOW Engine for all supported modes of operation. Table C-83. High-Speed Performance Key Size Direction Mode 128 Encryption UEA2 / Decryption 128-EEA1 Encryption UIA2 / Decryption 128-EIA1 Input/ Output Block Size Cycles per Block Throughput Bits/ Cycle At SNOW Typical Clock Frequency of 600 MHz Mbits/sec 32 1 32 19200 SNOW requires initialization of the LFSR and state. Therefore, 32 rounds are required for initialization after a key, IV and/or mode switch. The number of initialization cycles is per mode is shown in Table C-84. Table C-84. High-Speed Context Switch Overhead Overhead per Mode per New Context (Direction Independent) Extra Cycles Needed per New Context (Key / IV) mode is basic 32 + 3 = 35 mode is UEA2 / 128-EEA1 32 + 3 = 35 mode is UIA2 / 128-EIA1 32 + 1 + 2*1 + 2 + 2 + 3 = 41 C.1.3 Functional Description C.1.3.1 SNOW Key Stream Generator Introduction The SNOW Key stream generator implements the SNOW algorithm as specified in “[SNOW-3G]” on page 578. The core operates on the input IV and key and performs the required Feedback Shift, substitution, multiply and XOR operations. Each round results in a 32-bit key that is used to encrypt the data. Inherently, considerable parallelism is possible with the SNOW algorithm. This is described in the sections that follow. Sub-Modules FSM The main SNOW FSM in the key generation module consists of three state registers, two sets of sboxes to apply transform S1 and S2 [SNOW-3G], two 32-bit adders and two 32-bit XOR operations. At the start of the key initialization, the state registers are in reset and the state builds-up during the initialization phase of the LSFR. LFSR Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 257 Appendix C: Miscellaneous Accelerator Specifications The largest component of the key generation module is the LFSR. It contains 16 32-bit SHIFT registers, connected sequentially and a feedback loop consisting of three XOR operations, alpha multiplication and alpha division. The latter two components are defined as two recursive functions, explained in detail in the next paragraph. α and α-1 Two large components of the feedback loop of the LFSR are the alpha multiplication (MULα), which is a recursive multiplication in the GF(28) domain, and an inverse operation in the GF(28) domain defined as alpha division (DIVα). These two functions have an 8-bit input and return a 32bit result. Since these functions are recursive up to 256 levels deep, it is not efficient with respect to performance to implement the functions itself. To map these functions efficiently to hardware, α and α-1 are implemented as 8x32-bit look-up tables. For more information about the SNOW algorithm and the mathematical background of the described functions, refer to the SNOW specifications [SNOW-3G]. C.1.4 Feedback Logic and XOR The feedback logic module implements the confidentiality and integrity algorithms for SNOW as they are defined by ETSI/SAGE for use within 3GPP. Refer to [UEA2-UIA2]. In the confidentiality modes of operation (UEA2/128-EEA1), the plaintext/ciphertext is simply XORed with the generated key data. The result of the XOR operation is the corresponding ciphertext/plaintext. The integrity algorithm requires additional functional and control logic. Before the authentication of the message can start, the key stream generator must be called twice to produce a 64-bit key for the basic integrity EVAL_M function (refer to [UEA2-UIA2]), mainly consisting of a 64-bit polynomial multiplier. The implementation of this multiplier is configuration dependent. In the High Speed configurations, a 1-cycle version of this multiplier is implemented. In the Medium Speed configuration, a 5cycle version is used. The multiplier uses a fixed polynomial as defined by the SNOW algorithm (x64+x4+x3+x+1). To calculate the final 32-bit authentication result, the SNOW key generation must generate three additional parameters used to close the 64-bit multiplication sequence and XOR the final 32-bit result. 258 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice SNOW-3G Engines C.1.5 Examples Listing C-1. Example 1: Snow implementers test data, Test Set 1, [UEA2-UIA2-Test]: Key: IV In: Data In: Key Stream: Data Out: 2BD6459F EA024714 00000000 ABEE9704 ABEE9704 82C5B300 952C4910 4881FF48 AD5C4D84 DF1F9B25 1C0BF45F 00000000 7AC31373 7AC31373 SNOW Wide-bus Engine: key_in[127:0]: iv_in[127:0]: mode_in[1:0]: data_in[31:0]: 00000000 data_out[31:0]: 7AC31373 4881FF48 952C4910 82C5B300 2BD6459F 1C0BF45F DF1F9B25 AD5C4D84 EA024714 1 00000000 ABEE9704 Listing C-2. Example 2: Snow implementers test data, Test Set 4, [UEA2-UIA2-Test] (non-zero input data added) Key: IV In: Data In: Key Stream: Data Out: 0DED7263 6B68079A 12345678 D712C05C C5269624 109CF92E 41A7C4C9 9ABCDEF0 A937C2A6 338B1C56 3352255A 140E0F76 1BEFD79F 7FDCC233 90021010 …… …… EBABEFAC EB7EAAE3 …… …… 9C0DB3AA 7B7CBAF3 …… …… E26B3406 SNOW Wide-Bus Engine: key_in[127:0]: iv_in[127:0]: mode_in[1:0]: data_in[31:0]: 9ABCDEF0 90021010 …… …… EBABEFAC data_out[63:0]: 338B1C56 7B7CBAF3 …… …… E26B3406 140E0F76 3352255A 109CF92E 0DED7263 7FDCC233 1BEFD79F 41A7C4C9 6B68079A 0 12345678 C5269624 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 259 Appendix C: Miscellaneous Accelerator Specifications C.1.6 Operations C.1.6.1 General Operations This section describes the control and programming sequences of the SNOW-3G from a user perspective. It focuses on the typical/practical cases for regular, medium and maximum sized key data blocks. For a regular operation, mode, key, IV and data must be provided to the SNOW-3G. The mode, key and IV must be provided together or before the first key data block. For authentication modes (UIA2 / 128-EIA1), the length must also be included together with the mode, key and IV or before the first key data block. C.1.6.2 Encryption Modes: UEA2 / 128-EEA1 For a default encryption/decryption operation the following mode must be programmed in the SNOW Mode register: 3’b011 / 3’b010. • bit [0]: selects encryption /decryption (for this mode, this bit has no effect and can be set to any value) • bit [2:1]: selects UEA2/128-EEA1 In addition, the host must provide the following parameters: • key, used to seed the LFSR • iv, used to seed the LFSR • data, provided in 32-bit blocks via the data input bus The number of 32-bit ciphertext blocks must match the number of provided 32-bit input data blocks. Note: SNOW is a stream cipher and for this reason does not require padding. However, the SNOW-3G is 32-bit block oriented, which means that data must be submitted as 32-bit multiples. Therefore, if a message is a non-multiple of 32 bits, the last input block needs to be filled up with a value, for example zeroes, to complete the 32-bit block. The amount of remaining (invalid) bits from the last 32-bit result data block must be removed by the host or external system, after the encryption/decryption operation. Authentication Modes: UIA2 / 128-EIA1 For a authentication operation the following mode must be programmed in the SNOW Mode Register: 3’b100 or 3’b101. • bit [0]: sets DIRECTION bit • bit [2:1]: selects UIA2 / 128-EIA1 Besides the direction and mode of operation, the host must provide the following parameters: • key, used to seed the LFSR • iv, used to seed the LFSR • length (in bits), used to finalize the authentication operation data, provided in 32-bit blocks via the data input bus (The number of 32-bit blocks provided to the SNOW-3G, must match the length (rounded up to a 32-bit multiple) divided by 32.) The only result data is the 32-bit MAC-I. This MAC is only provided on the data output bus if all input data is provided to and processed by the SNOW core. 260 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice KASUMI Engines Note: Since the SNOW-3G is 32-bit oriented and the amount of authentication data can be bit aligned, the SNOW-3G pads the remaining 32 bits with zeroes if needed. Note that the host provides 32-bit blocks of data, meaning that when the length is non 32-bit aligned, bits from the last 32-bit word are forced to zero by the SNOW-3G. Refer to “Glossary, Conventions and Standards” on page 585 for information on conventions and standards. C.2 KASUMI Engines C.2.1 Introduction C.2.1.1 Specification Summary The KASUMI engine supports the following features: • Key scheduling hardware • f8 and f9 algorithm support • Automatic data padding mechanism for f9 algorithm • KASUMI encryption and decryption modes KASUMI Engine mode [3:0] config [31:0] fresh [31:0] f8 / f9 Wrapper KASUMI Key Scheduling key [127:0] KASUMI Calculation data [63:0] data [63:0] Figure C-2: KASUMI Engine Diagram C.2.2 KASUMI Engine Functional Description C.2.2.1 General Processing The KASUMI engine is an efficient implementation of the KASUMI cipher algorithm, the f8 confidentiality algorithm and the f9 integrity algorithm. The KASUMI engine contains three operational modes: • KASUMI mode (encrypt and decrypt) • f8 mode Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 261 Appendix C: Miscellaneous Accelerator Specifications • f9 mode In general, the plain message data is parsed and sequentially fed into the engine in 64-bit blocks. The processing of one 64-bit data block takes 8 rounds. A round takes one clock cycle. It is not possible to interleave two packets using f8 and/or f9 mode. This means a full f8 or f9 operation must be finished before the next one can be started. KASUMI Mode During KASUMI mode, the 64-bit plaintext data blocks are encrypted/decrypted under a 128-bit key, which is programmed into the KASUMI engine by the Host. During each round, the Round Keys are derived from this initial key. After eight processing rounds, the KASUMI encrypted cipher text data block will be stored in the output data register. The KASUMI engine is able to operate in the following sub modes: • Encrypt mode • Decrypt mode When operating in encrypt mode, plaintext data is transformed into KASUMI cipher text data. In decrypt mode, the KASUMI cipher text data is transformed back into the original plaintext data. More information regarding the KASUMI algorithm can be found in [3GPP TS 35.202]. 3.1.2 f8_mode During f8 mode, the 64-bit message data blocks are transformed into 64-bit output data blocks under the control of a 128-bit Confidentiality Key. The total message length can be up to 216 = 65536 bits. Note: KASUMI f8 is a stream cipher and does not require padding for this reason. However, the KASUMI engine is block oriented, which means that data must be submitted as 64-bit multiples. Therefore, in case a message ending in a non-multiple of 64-bit, the last input block needs to be filled up with some value, for example zeroes, to complete the 64-bit block. The amount of remaining (invalid) bits from the last 64-bit result data block must be removed, by the host or external system, after the encryption/decryption operation. More information regarding the f8 algorithm can be found in [3GPP TS 35.201]. f9_mode During f9 mode, the message data is processed in 64-bit chunks under the control of a 128-bit Integrity Key. The result, after all message data has been processed, is a 32-bit MAC value. The total length of the message data can be up to 2 16 = 65536 bits. C.2.2.2 Examples Listing C-3. Example 1: KASUMI Encrypt Session [3GPP TS 35.203], chapter 3.3 (Test Set 1): Key: 2BD6459F 82C5B300 952C4910 4881FF48 Data input:EA024714 AD5C4D84 Data output:DF1F9B25 1C0BF45F 262 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice KASUMI Engines KASUMI engine Interface: key_in[127:0]:2BD6459F 82C5B300 952C4910 4881FF48 data_in[63:0]:EA024714 AD5C4D84 data_out[63:0]:DF1F9B25 1C0BF45F Listing C-4. Example 2: f8 Session [3GPP TS 35.203], chapter 4.3 (Test Set 1): Key: 2BD6459F 82C5B300 952C4910 4881FF48 Count:72A4F20F Bearer:0C Direction:1 Length:798 bits Data input:7EC61272 743BF161 .. 9B134880 Data output:D1E2DE70 EEF86C69 .. 9339650F KASUMI engine Interface: key_in[127:0]:2BD6459F 82C5B300 952C4910 4881FF48 count_in[31:0]:72A4F20F config_in[31:0]:00000019 data_in[63:0]:7EC61272 743BF161 : 9B134880 [00000000] <- pad to the right to a full block data_out[63:0]:D1E2DE70 EEF86C69 : 9339650F [25915EE3] <- ignore encrypted padding Listing C-5. Example 3: f9 Session [3GPP TS 35.203], chapter 5.3 (Test Set 1): Key: 2BD6459F 82C5B300 952C4910 4881FF48 Count:38A6F056 Fresh:05D2EC49 Direction:0 Length:189 bits Data input:6B227737 296F393C 8079353E DC87E2E8 05D2EC49 A4F2D8E0 MAC output:F63BD72C KASUMI engine Interface: key_in[127:0]:2BD6459F 82C5B300 952C4910 4881FF48 count_in[31:0]:38A6F056 fresh_in[31:0]:05D2EC49 config_in[31:0]:00BD0000 data_in[63:0]:6B227737 296F393C 8079353E DC87E2E8 05D2EC49 A4F2D8E0 data_out[63:32]:F63BD72C Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 263 Appendix C: Miscellaneous Accelerator Specifications C.3 Packet Processor — Programming C.3.1 Introduction C.3.1.1 Purpose This section describes how to program the various protocols and modes within the Tilera’s packet processor properly. This process involves: • Identifying key protocol concerns and programming procedures • Describing protocol processing flows and data flows within the packet processor providing programming examples C.3.1.2 Scope This section specifically covers the use of the packet processor protocols, modes, and onboard token and context interfaces for the supported configurations. C.3.1.3 Abbreviation and Definitions For a complete list of abbreviations and definitions, refer to “Glossary, Conventions and Standards” on page 585. Note: Since TLS and DTLS protocols originate from the same SSL protocol, the abbreviation ICV (Integrity Check Value) is interchangeable with the abbreviation MAC (Message Authentication Code). C.3.1.4 Data Flow Table In this appendix we often refer to the data flow table, which represents data movement during packet processing. The generic view of the table is shown in Table C-85. Table C-85. Data Flow Table N <instruction number> Instruction <instruction> Source of Data Destination Remove Hash Cipher Output Context <source> <data to remove> <data to hash> <data to encrypt / decrypt> <data for the output> <data for context record> Each line of the table represents a separate token instruction for data movement and processing, except the execution of the REMOVE RESULT instruction, which indicates action of the previously scheduled instruction. The Instruction field represents the name of the processing instruction. The Source of data field represents the source of the data block. The Destination field present the destination of the data during processing. It is possible for same data to go to several different destinations at the same time. 264 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming The data flow tables use the following colors to highlight specific data types, as shown in Table C86. Table C-86. Data Flow Tables Color Fields Cyan Bypassed data Blue Packet header data Yellow IV and other fields Grey Packet payload, padding etc. Light-green ICV, MAC or other message integrity or authentication field Ochre Additional fields used internally for processing C.3.2 ARC4 Algorithm The packet processor supports ARC4 algorithm according to the ARC4 specification as a basic operation. The ARC4 state table is located the Extra Data immediately following the Context Record. Pointers to the state record and the I-J Pointer are located in the context. The ARC4 mode is controlled via the following bits: • State selection. Select stateless (0) or statefull (1) mode. • I-J Pointer. When set indicates that I-J Pointer and ARC4 state pointer are present in context record. • Crypto store. Must be set to 1 for later reuse of I-J Pointer. • Initialize ARC4. Bit in packet based options field and is applicable only when context control words are loaded from the token. This bit, when set, enables initialization of ARC4 state memory with the default state and overrules statefull mode of state selection field. The fetching and update of ARC4 state is shown in Figure C-3. Context Memory ARC4 Memory CONTEXT_ACCESS Instruction (I-J pointer update) Context Context Fetch (Including I-J pointer and ARC4 state pointer) ARC4 State Pointer ARC4 State (256 bytes) ARC4 state fetch CONTEXT _ ACCESS instruction (ARC4 state update) Context DMA MUX I-J Pointer Context ARC4 Core Figure C-3: Fetching and Update of Context Record and ARC4 State Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 265 Appendix C: Miscellaneous Accelerator Specifications When packet with ARC4 in statefull mode is started, the context record should contain I-J Pointer and ARC4 state pointer fields. During context fetch, the ARC4 state will be automatically fetched into the dedicated ARC4 ram immediately after context fetch. Note that context and ARC4 state are both in the Extra Data. Update of the ARC4 state and I-J Pointer in the external context memory is done by using respectively two CONTEXT_ACCESS instructions. C.3.3 AES-CCM for Basic Operations and IPSec Protocols C.3.3.1 Introduction The packet processor supports AES-CCM as a basic operation, as specified in RFC3610 and as part of the IPSec ESP protocol. Refer to “Packet Processor References” on page 577 for more information. The AES-CCM combines two cryptographic mechanisms that are based on AES encryption. The processing diagram is shown in Figure C-4. AES AES-CBC encrypt Input for authentication AES 16 bytes 16 bytes B0 B1 IV ... Bk AES AES 16 bytes 16 bytes Bk+1 Bk+2 0 (pad) aad Tag (truncated) 16 bytes ... Bm 0 (pad) message 8/12/16 bytes Input for authentication and encryption Header Payload S2 S1 AES-CTR encrypt Counter blocks Result ciphertext Pad AES AES 16 bytes 16 bytes A1 A2 Sn 16 bytes C1 C2 AES 16 bytes 16 bytes An A0 ... Result of CBC-MAC S0 (trunc) AES ... 16 bytes Tag 16 bytes 8/12/16 bytes Cn MAC (Truncated) Figure C-4: AES-CCM Processing Diagram One mechanism within CCM is the Counter (CTR) mode for confidentiality, which is specified in [RFC3686]. The CTR mode requires the generation of a sequence of blocks, called counter blocks that are unique for a given key. The other cryptographic mechanism within CCM is an adaptation of the Cipher Block Chaining (CBC) technique from [NIST SP800-38a] to provide assurance of authenticity. In the CBC technique an initialization vector is applied to the data to be authenticated. The final block of the resulting CBC output serves as a Message Authentication Code (MAC) of the data. The algorithm for generating a MAC is commonly called CBC-MAC. The same cryptographic key is used for both the CTR and CBC-MAC mechanisms within CCM. The CCM specification [RFC 3610] defines two parameters: 1. 266 M — The size of the authentication field, which can be any of the following: 4, 6, 8, 10, 12, 14 and 16 bytes. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming Although for IPSec ESP only 8 and 16 bytes are mandatory, the packet processor also supports an authentication field size of 12 bytes for IPSec ESP, since this is the default in the ESP specification [RFC 4303]. For basic operations, all ICV sizes are supported. 2. L — The size of the length field can have any value between 2 bytes and 8 bytes. For AES-CCM with IPSec ESP, this length must be 4 bytes only. For Basic operations a length of 2 bytes and 4 bytes is supported. A length of more than 4 bytes can be supported; but generally, this is not required. C.3.3.2 Authentication For authentication, the data are divided in a sequence of 128-bit blocks B0, B1... Bm (see also Figure C-4). Then the AES-CBC-MAC function is applied to these blocks. The first block, B0 (the IV for the AES-CBC function), is formatted as shown in Figure C-5. Length (m) = 4 bytes bit Flags byte 7 6 5 0 Adata 3 4 3 2 M length (encoded) Length (m) = 2 bytes 1 L length (encode = 011) 2 1 Flags Flags 7 6 0 Adata byte 0 Nonce N ( Salt) B_0 bit 0 5 3 4 3 2 M length (encoded) 2 1 Nonce N ( Salt) B_0 Nonce N (IV) 1 0 L length (encoded = 001) 0 Flags Nonce N (IV) Length (m) Length (m) Figure C-5: Block B0 for Authentication The values for the flags register are shown in Table C-5. Table C-5. Flags Register for B0 Vector Bit Field Name Description [7] Reserved This bit must be set to 0. [6] Adata Additional Authentication Data. 0 Length of AAD is zero. 1 Length of AAD is not zero. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 267 Appendix C: Miscellaneous Accelerator Specifications Table C-5. Flags Register for B0 Vector Bit Field Name Description [5:3] M Length MAC Length. This field indicates the encoded size (length) of the authentication field (MAC). 000 Illegal 001 Length M is 4 bytes 010 Length M is 6 bytes 011 Length M is 8 bytes 100 Length M is 10 bytes 101 Length M is 12 bytes 110 Length M is 14 bytes 111 Length M is 16 bytes [2:0] L Length Encoded size of the 000 Illegal 001 Length L is 010 Length L is 011 Length L is 100 Length L is 101 Length L is 110 Length L is 111 Length L is length of the length field. 2 3 4 5 6 7 8 bytes bytes bytes bytes bytes bytes bytes If AAD data are present, as indicated by the Adata bit, then AAD length field plus additional authentication data are added. The last block is padded with zeros to a full 16-byte block. The AES-CBC-MAC is computed over all the data blocks, B0…Bm, and the final result is a tag T. C.3.3.3 Encryption Encryption uses the AES-CTR function to transform the plain text into cipher text and vice-versa. The cipher input is a sequence of 128-bit counter blocks A1, A2 … An, and then A0 (see also Figure C-4). The format of A0 is shown in Figure C-6. The Flags field provides information about the length of the message data. Shown in Figure C-6 are the layouts of the A0 vector for two cases (counter length is 4 bytes and counter length is 2 bytes). Counter Length = 4 bytes bit 6 0 Flags byte A0 7 5 4 0 3 3 2 Counter Length = 2 bytes 1 Counter length (encoded = 011) 000 2 1 Nonce N (Salt) bit 0 6 0 Flags byte 0 Flags 7 5 4 0 2 2 1 Nonce N (Salt) Nonce N (IV) 1 0 Counter length (encoded = 001) 000 3 A0 3 0 Flags Nonce N (IV) Counter (m) Counter (m) Figure C-6: Block A0 for Encryption The values for flags register are shown in Table C-6. 268 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming Table C-6. Flags Register for A0 Vectora Bit Field Name Description [7:3] Reserved Must be set to 0 [2:0] L Length Encoded size of the 000 Illegal 001 Length L is 010 Length L is 011 Length L is 100 Length L is 101 Length L is 110 Length L is 111 Length L is a. length of the block counter. 2 3 4 5 6 7 8 bytes bytes bytes bytes bytes bytes bytes The initial counter value of the A0 vector must be initialized to zero, in contrast to AESGCM/ GMAC, where the initial counter it is initialized to 1. Note: The initial counter value of the A0 vector must be initialized to zero, in contrast to AESGCM/GMAC, where the initial counter it is initialized to 1. C.3.3.4 Implementation The AES-CCM basic operation within the packet processor requires that A0 and B0 fields be preImplementation The AES-CCM basic operation within the packet processor requires that A0 and B0 fields be precalculated by the host and provided to the packet processor. The B0 for hashing is provided via the token and A0 – initial IV for counter mode is provided via the IV field in the context record. The encrypted A0, which is XOR-ed with the TAG, is created by encrypting a block of zeros with IV=A0. Later the encrypted result Y0 is removed from the output stream. The authentication keys for XCBC engine are the following: • Key1 and Key2 are zero. • Key0 is the same as the cipher key, but with each word swapped. Since A0 and B0 are pre-calculated by the host, any allowable values for M, L and Counter Length (maximum 4) are supported. In the case of an ESP packet, the salt/nonce values come from the Security Association. The implementation summary is presented in Table C-7. Table C-7. AES-CCM Supported Functionality Functionality Inbound Key Length 128-, 192-, 256-bit M, L Any Any Counter Length 1–4 1-4 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Outbound 269 Appendix C: Miscellaneous Accelerator Specifications Table C-7. AES-CCM Supported Functionality (continued) Functionality A0 Vector Inbound Outbound From context From token From context From context IV a From token (For ESP, IV is taken from input) From token (For ESP, IV is taken from Counter From context IV (Zero value) From context IV (Zero value) Flag From token From token From token From token From token (For ESP, IV is taken from input) From token (For ESP, IV is taken from From token From token From token From token Flag a Salt B0 Vector Salt IV a a Message length AAD Length Field a. b. contextb) contextb) The Salt and IV must be the same for A0 and B0 vectors. IV in the context can be also generated by the PRNG. For information about byte order, refer to the “Examples” on page 259. C.3.3.5 Basic Operation Introduction This section explains how to perform basic inbound and outbound transforms using AES-CCM basic operation with different parameters. Context Control Words The AES-CCM processing requires that correct mode of the engine be set in context control words. The layout and settings of the control words are shown in the tables that follow. Context – Control Word 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 0 1 1 0 0 - - - 1 - - - - ToP packet-based options 0 key 0 crypto algorithm 0 reserved 0 digest type 0 hash algorithm SEQ 0 reserved MASK0 0 SPI MASK1 context length - - - - 0 0 0 0 - - - - The applicable fields are listed in Table C-8. 270 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming Table C-8. Basic AES-CCM Context Control Word 0 Field Value Description hash algorithma 001 XCBC with 128-bit Key 1. 010 XCBC with 192-bit Key 1. 011 XCBC with 256-bit Key 1. digest type 10 Use dedicated hash algorithm (XCBC). crypto algorithm 101 AES-128. 110 AES-192. 111 AES-256. key 1 The Key is used in processing. context length * See description in “context length” on page 437. packet based options 0000 Default value. ToP (Type of Packet) 1110 For outbound (hash-then-encrypt operation). 0111 For inbound (decrypt-then-hash operation). a. The length of Key 1 for XCBC must be equal to AES key length. Choosing different key lengths for XBC and AES algorithms will result in incorrect results. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 271 Appendix C: Miscellaneous Accelerator Specifications Context – Control Word 1 state selection i-j-pntr hash store reserved enc. hash result pad type 0 0 0 0 0 1 0 0 0 0 0 0 0 0 crypto mode seq. nbr. store 0 Feedback disable mask upd. 0 IV0 reserved 0 IV1 reserved 0 IV2 reserved 0 IV3 reserved 0 digest cnt reserved 0 IV format reserved 0 crypto-store reserved 0 reserved address mode 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 - - - - 0 0 1 1 0 The applicable fields are listed in Table C-9. Table C-9. Basic AES-CCM Context Control Word 1 Field Value Description hash store 0 Do not save result digest into internal context register. enc hash result 1 Use encrypted hash result (XOR operation). pad type 000 No padding. 100 IPSec padding. crypto store 0 Do not store result IV. IV format 00 Full IV mode. IV3...IV0 1001 IV0 + IV3 – for ESP inbound. 1111 16-byte IV (AES) – for all other cases. 110 AES-CTR with load/reuse of the counter. crypto mode Outbound Data Flow The basic outbound operation for AES-CCM is executed according to Table C-10. Table C-10. Basic Outbound AES-CCM N Instruction DIRECTION (Bypass data) Source of Data Destination Remove Hash Cipher Output Context input - - - Bypass - data 272 INSERTa (B0+AAD length field) token - B0+AAD length field - - - DIRECTIONa (AAD data) input - AAD data - AAD data - INSERTa (zero padding for AAD data) instruction - Zeroes - - - Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming Table C-10. Basic Outbound AES-CCM (continued) N Instruction Source of Data Destination Remove Hash Cipher Output Context REMOVE_RESULT (schedule removal of S0 from the output) output buffer - - - - - INSERT (zeroes for S0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTION (Payload) input - Payload Payload Payload - INSERT (zero padding for cipher data) instruction - Zeroes - - - Execution of REMOVE_RESULT instruction output buffer S0 - - - - INSERT (result TAG field) context (hash result) - - - TAG - a. In situations when there is no AAD data, the AAD length, and supplementary zero padding for AAD data are not hashed. 1. The outbound processing functions as follows: 2. The bypass data are passed directly to the output. 3. The B0, AAD length, AAD data and zero padding are inserted to the hash stream. 4. Schedule an instruction to remove the S0 from the output buffer. 5. Block of zeros is inserted to the cipher stream to create encrypted A0 – S0. 6. The payload data are hashed, encrypted and then passed to the output. 7. Additional zero padding is inserted to the hash stream. 8. The S0 block is removed from the output buffer. 9. Result TAG is appended to the output stream. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 273 Appendix C: Miscellaneous Accelerator Specifications Inbound Data Flow The basic inbound operation for AES-CCM is executed according to Table C-11. Table C-11. Basic Inbound AES-CCM N Instruction Source of Data Destination Remove Hash Cipher Output Context DIRECTION (bypass data) input - - - Bypass - INSERT (B0+AAD length field) token - B0+AAD length - - - DIRECTION (AAD data) input - AAD data - AAD data - INSERT (zero padding for AAD data) instruction - Zeroes - - - REMOVE_RESULT (schedule removal of S0 from the output) output buffer - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTION (encrypted Payload) input - Payload Payload Payload - INSERT (zero padding for cipher data) instruction - Zeroes - - - Execution of REMOVE_RESULT instruction output buffer S0 - - - - RETRIEVE (TAG field) input - - - - TAG VERIFY_FIELDS (calculated TAG with the TAG from the input) context - - - - - The inbound processing functions as follows: 274 1. The bypass data are passed directly to the output. 2. The B0, AAD length, AAD data and zero padding are inserted to the hash stream. 3. Schedule an instruction to remove the S0 from the output buffer. 4. Block of zeros is inserted to the cipher stream to create encrypted A0 – S0. 5. The payload data are hashed, decrypted, and then passed to the output. 6. Additional zero padding is inserted to the hash stream. 7. The S0 block is removed from the output buffer. 8. Result TAG is retrieved from the output stream and stored in the context. 9. Calculated TAG is compared with retrieved from the input stream. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming Packet Processor Examples This section explains how the test vectors can be applied to the packet processor core. As an example, the test vector #1 from the REC3610 reference (See Chapter F: RFC 3610 is shown below. =============== Packet Vector #1 ================== AES Key = C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF Nonce = 00 00 00 03 02 01 00 A0 A1 A2 A3 A4 A5 Total packet length = 31. [Input with 8 cleartext header octets] 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E CBC IV in: 59 CBC IV out:EB After xor: EB After AES: CD After xor: C5 After AES: 9C After xor: 84 After AES: 2D CBC-MAC : 2D 00 9D 95 B6 BF 38 21 C6 C6 00 55 55 41 4B 40 5A 97 97 00 47 46 1E 15 5E 45 E4 E4 03 73 71 3C 30 A0 BC 11 11 02 09 0A DC D1 3C 21 CA CA 01 55 51 9B 95 1B 05 83 83 00 AB AE 4F 40 C9 C9 A8 A8 A0 23 25 5D 4D 04 04 60 CTR Start: CTR[0001]: CTR[0002]: CTR[MAC ]: 00 85 46 2E 00 9D 71 46 00 91 7A C8 03 6D C6 EC 02 CB DE 33 01 6D 9A A5 00 DD FF 48 A0 A1 A2 A3 E0 77 C2 D1 64 0C 9C 06 01 50 75 3A Total packet length 00 01 02 F0 66 D0 E8 D1 2C A1 1E 19 92 83 B5 B5 C2 A2 0A 0A 58 4A 8B 8B C4 A3 2D 2D B6 A5 40 40 06 A4 FE FE 9E 8A C7 C7 CC A5 4B 4B E7 F2 6C 6C AA 00 90 90 F0 E6 A2 A2 54 17 D6 D6 91 86 EB EB 2F [hdr] [msg] [msg] A4 A5 00 01 D4 EC 9F 97 DE 6D 0D 8F = 39. [Authenticated and Encrypted Output] 03 04 05 06 07 58 8C 97 9A 61 C6 63 D2 C2 C0 F9 89 80 6D 5F 6B 61 DA C3 84 17 FD F9 26 E0 According to the Introduction and Implementation sections, the host should pre-calculate A0 and B0 vectors. The A0 vector should be written to the IV fields of the context: IV0 IV1 IV2 IV4 [31:0]: [31:0]: [31:0]: [31:0]: 00000001 (Nonce + Flag) 00010203 A3A2A1A0 0000A5A4 (Initial counter = 0) The B0 vector and AAD data length are provided by using INSERT from the token instruction. Word Word Word Word Word 0 1 2 3 4 [31:0]: [31:0]: [31:0]: [31:0]: [16:0]: 00000059 (CBC IV0) 00010203 (CBC IV1) A3A2A1A0 (CBC IV2) 1700A5A4 (CBC IV3) 0800 (AAD length) The cipher key is provided via Key field of the context. Key Key Key Key 0 1 2 3 [31:0]: [31:0]: [31:0]: [31:0]: C3C2C1C0 C7C6C5C4 CBCAC9C8 CFCECDCC Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 275 Appendix C: Miscellaneous Accelerator Specifications The has key is provided via XCBC Key fields of the context K1_0 K1_1 K1_2 K1_3 [31:0]: [31:0]: [31:0]: [31:0]: C0C1C2C3 C4C5C6C7 C8C9CACB CCCDCECF K2_0, K2_1, K2_2, K2_3: 00000000 K3_0, K3_1, K3_2, K3_3: 00000000 The result TAG is stored in result digest field and has the following byte order. Hash_Result_0 [31:0]: 2CD1E817 Hash_Result_1 [31:0]: E026F9FD C.3.3.6 ESP Introduction This section explains how to perform ESP inbound and outbound transforms using AES-CCM. The AES-CCM is the first standard that defines a variable size ICV as a must. For IPSec operations that use AES-CCM and ICV size of 8 and 16 bytes must be supported and 12 bytes can be supported. Since ICV size is a matter of configuring token instructions, all sizes are supported. For both inbound and outbound cases, the following data is provided by the host: • For B0 vector, the Flag is calculated and concatenated with the Salt value and provided in the token. • Message length and AAD length are provided in the token. Outbound Flow The AES-CCM for ESP outbound is executed according to Table C-12. Table C-12. ESP outbound with AES-CCM N 276 Instruction Source of Data Destination Remove Hash Cipher Output Context INSERT (Flag and Salt) token - Flag and Salt - - - INSERT (IV) context (IV1 offset) - IV - - - INSERT (Msg. length + AAD length) token - Msg. length + AAD length - - INSERTa (SPI) context - ESP SPI - ESP SPI - INSERT b (Seq. num. high for ESN) context - Seq. num. high - - - INSERTa (Seq. num. low) context - Seq. num. low - Seq. num. low - INSERT (zero padding for AAD data) instruction - Zeroes - - - Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming Table C-12. ESP outbound with AES-CCM (continued) N a. b. Instruction Source of Data Destination Remove Hash Cipher Output Context INSERT (IV) context (IV1 offset) - - - IV - REMOVE_RESULT (schedule removal of S0 from the output) output buffer - - - - - INSERT (zeroes for S0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTION (payload) input - Payload Payload Payload - INSERT (IPSec crypto padding) instruction - IPSec padding IPSec padding IPSec padding - INSERT (zero padding for cipher data) instruction - Zeroes - - - Execution of REMOVE_RESULT instruction output buffer S0 - - - - INSERT (ICV field) context (hash result) - - - ICV - CONTEXT_ACCESS (update sequence number) context - - - - sequence number For regular sequence numbering these instructions can be combined into one instruction. Applicable in case of using the extended sequence numbering. Outbound processing functions as follows: 1. The word containing a Flag and Salt value is taken from the token and provided to the hash engine. 2. IV data is taken from the context and provided to the hash engine. 3. Finally, message length and AAD length fields are taken from the token and provided to the hash engine to complete (B0 + AAD length) vector. 4. The ESP header is taken from the context and hashed. In case of ESN, upper sequence number is only hashed before hashing regular sequence number. 5. Zero padding is provided to the hash engine to pad hash data to a hash block size. 6. The IV is inserted in the output stream. 7. Schedule an instruction to remove the S0 from the output buffer. 8. Block of zeros is inserted to create S0 vector. 9. Payload and optional padding are encrypted and then passed to the output stream. 10. Zero padding is inserted in the hash engine to pad payload data to a hash block size. 11. The S0 block is removed from the output buffer. 12. The calculated ICV is inserted in the output stream. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 277 Appendix C: Miscellaneous Accelerator Specifications 13. Updated sequence number is written back to the context record in memory. Inbound Flow The AES-CCM for ESP inbound is implemented according to Table C-13. Table C-13. ESP Inbound with AES-CCM N a. b. 278 Instruction Source of Data Destination Remove Hash Cipher Output Context RETRIEVE (store ESP header in context) input - - - - ESP header INSERT (Flag and Salt for B0) token - Flag and Salt - - - RETRIEVE (store IV in context and pass to hash) input - IV - - IV (IV1 offset) INSERT (message length field + AAD length field) token - msg. length + AAD length - - - INSERTa (ESP SPI) context - ESP SPI - - - INSERTb (Seq. num. high for ESN) context - Seq. num. high - - - INSERTa (Seq. num. low) context - Seq. num. low - - - INSERT (zero padding for AAD data) instruction - Zeroes - - - REMOVE_RESULT (schedule removal of S0 from the output) output buffer - - - - - INSERT (zeroes for S0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTION (encrypted Payload) input - Payload Payload Payload - INSERT (zero padding for cipher data) instruction - Zeroes - - - Execution of REMOVE_RESULT instruction output buffer S0 - - - - RETRIEVE (ICV field) input - - - - ICV VERIFY_FIELDS (ICV, Padding, SPI, Seq.num) context - - - - - CONTEXT_ACCESS (update sequence number and mask) context - - - - seq. number and mask For regular sequence numbering these instructions can be combined into one instruction. Applicable in case of using the extended sequence numbering. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming The inbound processing functions as described below: 1. The ESP header is retrieved from the input stream and stored in the context. 2. The word containing Flag and Salt value in taken from the token and provided to the hash engine. 3. IV data are retrieved from the input stream and is stored in the context at the same time it is passed to the hash engine. 4. Finally message length and AAD length fields are taken from the token and provided to the hash engine to complete (B0 + AAD length) vector. 5. The ESP header is provided to the hash engine from the context. In case of ESN, upper sequence number is only hashed before hashing regular sequence number. 6. Zero padding is inserted in the hash engine to pad hash data to the hash block size. 7. Schedule an instruction to remove the S0 from the output buffer. 8. Block of zeros is inserted to create S0 vector. 9. Payload is decrypted, de-padded, and then passed to the output stream. 10. Zero padding is inserted to the hash engine to pad the payload data to a hash block size. 11. The S0 block is removed from the output buffer. 12. The calculated ICV is retrieved from the output stream and stored in the context. 13. The ICV, padding, SPI, and sequence number are verified. 14. Updated sequence number and mask are written back to the context record in the memory. C.3.4 Protocols AES-GMAC/AES-GCM for Basic Operations and IPSec C.3.4.1 Introduction The packet processor supports AES-GCM as a basic operation and as part of ESP. The packet processor supports AES-GMAC as a basic operation and as part of ESP and AH protocols. For more information refer to [REF41066] and [RFC4543] in “Packet Processor References” on page 577. The AES-GCM/GMAC operation in packet processor is performed by usage GHASH and AESCTR sub-modules. During processing, Y0 for TAG encryption is created by encrypting a block of zeros with the initial IV. The Y0 is removed later from the output stream. C.3.4.2 Basic Operation Introduction This section explains how to perform basic inbound and outbound transforms using AES-GCM/ GMAC. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 279 Appendix C: Miscellaneous Accelerator Specifications Outbound Data Flow The AES-GCM/GMAC outbound processing is executing according to Table C-14. Table C-14. Basic AES-GCM/GMAC Processing Flow N Instruction DIRECTION (Bypass data) Source of Data Destination Remove Hash Cipher Output Context input - - - Bypass - data DIRECTION (AAD data) input - AAD data - AAD data - REMOVE_RESULT (schedule removal of Y0 from the output) output buffer - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTIONa (Payload) input - Payload Payload Payload - Execution of REMOVE_RESULT instruction output buffer Y0 - - - - INSERT (calculated TAG) context - - - TAG - (hash result) a. This is only for AES-GCM. In case of AES-GMAC, there is no payload data. The whole packet is considered to be the AAD data. The outbound processing functions as follows: 280 1. The bypass data are passed directly to the output stream. 2. The AAD data are inserted in the hash stream. 3. Schedule an instruction to remove the Y0 from the output buffer. 4. A block of zeros is inserted in the cipher stream to create encrypted Y0. 5. The payload data are hashed, encrypted and then passed to the output stream. 6. The encrypted Y0 block is removed from the output buffer. 7. The Result TAG is appended to the output stream. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming Inbound Data Flow The AES-GCM/GMAC inbound processing is executing according to Table C-15. Table C-15. Basic AES-GCM/GMAC Processing Flow N Instruction DIRECTION (pass bypass data to the output) Source of Data Destination Remove Hash Cipher Output Context input - - - Bypass - data DIRECTION (AAD data) input - AAD data - AAD data - REMOVE_RESULT (schedule removal of Y0 from the output) output buffer - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - a input - Payload Payload Payload - RETRIVE (Store TAG in the context) input - - - - TAG result Execution of REMOVE_RESULT instruction output buffer Y0 - - - - VERIFY (calculated TAG) context - - - - TAG DIRECTION (Payload) (hash result) a. This is only for AES-GCM. In case of AES-GMAC, there is no payload data. The whole packet consists only of AAD data. The outbound processing functions as follows: 1. The bypass data are passed directly to the output. 2. The AAD data are inserted to the hash stream. 3. Schedule an instruction to remove the Y0 from the output buffer. 4. Block of zeros is inserted in the cipher stream to create encrypted Y0. 5. The payload data is hashed, decrypted, and then passed to the output stream. 6. The TAG is retrieved from the input stream and stored in the context. 7. The encrypted Y0 block is removed from the output buffer. 8. Calculated TAG is compared with retrieved TAG in the context. C.3.4.3 IPSec Introduction This chapter describes how to perform ESP transforms using AES-GCM/GMAC, and AH transforms using AES-GMAC algorithm. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 281 Appendix C: Miscellaneous Accelerator Specifications Context Control Words For IPSec processing the context control words must be configured correctly. The layout and allowable settings of the control words are shown in the figures that follow. Context – Control Word 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 0 0 1 0 1 0 1 1 - - - - - - - - - options 0 packet-based 1 ToP key 0 crypto algorithm 1 reserved 1 digest type 0 hash algorithm SEQ - reserved MASK0 - SPI MASK1 context length - - - - - - - The applicable fields are listed in Table C-16. Table C-16. IPSec with GMAC/GCM Context Control Word 0 Field Value Description MASK1, MASK0 00 Outbound processing does not use mask. 01 Inbound processing with 64-bit mask. 11 Inbound processing with 128-bit mask. SEQ 01 Use 32-bit packet number. SPI 1 SPI value is used in processing. hash algorithm 100 GHASH. digest type 10 Use dedicated hash algorithm (GHASH). crypto algorithm 101 AES-128. 110 AES-192. 111 AES-256. Key 1 The Key is used in processing. context length * See description in “context length” on page 437. packet based options 0000 Default value. ToP (Type of Packet) 0110 Outbound encrypt-then-hash operation for AES-GCM. 1110 Outbound hash-then-encrypt operation for AES-GMAC. 1111 Inbound hash-then-decrypt operation for both GCM and GMAC. Context – Control Word 1 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 282 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice state selection i-j-pntr hash store reserved enc. hash result pad type 0 0 0 0 0 1 0 0 0 0 0 1 0 0 crypto mode seq. nbr. store 0 Feedback disable mask upd. 0 IV0 reserved 0 IV1 reserved 0 IV2 reserved 0 IV3 reserved 0 digest cnt reserved 0 IV format reserved 0 crypto-store reserved 0 reserved address mode Packet Processor — Programming - - - - 0 0 0 1 0 The applicable fields are listed in Table C-17. Table C-17. IPSec with GMAC/GCM Context Control Word 1 Field Value Description hash store 0 Do not save result digest into internal context register. enc hash result 1 Use encrypted hash result. pad type 000 No padding. 100 IPSec padding. 111 IPSec zero padding. crypto store 0 Do not store result IV. IV format 01 Counter mode. IV3..IV0 0111 12-byte IV (AES), when IV is from the context. crypto mode 010 AES-CTR with counter initialized to 1. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 283 Appendix C: Miscellaneous Accelerator Specifications ESP Outbound Flow The ESP outbound flow with AES-GMAC is shown in Table C-18. Table C-18. ESP Outbound with AES-GMAC N a. b. 284 Instruction Source of Data Destination Remove Hash Cipher Output Context INSERTa (ESP SPI) context - ESP SPI - ESP SPI - INSERTb (Seq. num. high) context - Seq. num. high - - - INSERTa (Seq. num. low) context - Seq. num. low. - Seq. num. low. - INSERT (IV) context (IV1 offset) - IV - IV - REMOVE_RESULT (schedule removal of Y0 from the output) output buffer - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTION (Payload) input - Payload - Payload - INSERT b (IPSec padding) instruction - IPSec padding - IPSec padding - Execution of REMOVE_RESULT instruction output buffer Y0 - - - - INSERT (ICV field) context - - - ICV - CONTEXT_ACCESS (write out updated sequence number) context - - - - Sequence number field Applicable in case of using the extended sequence numbering. For regular sequence numbering these instructions can be combined into one instruction. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming ESP Inbound Flow The ESP inbound flow with AES-GMAC is shown in Table C-19 (regular sequence numbering). Table C-19. ESP Inbound with AES-GMAC N a. Instruction Source of Data Destination Remove Hash Cipher Output Context RETRIEVEa (hash and store ESP header) input - ESP header - - ESP header (SPI offset) RETRIEVE (store IV in context) input - IV - - IV (IV1 offset) REMOVE_RESULT (schedule removal of Y0 from the output) output buffer - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTIONa (encrypted Payload) input - Payload Payload Payload - Execution of REMOVE_RESULT instruction output buffer Y0 - - - - RETRIEVE (ICV field) input - - - - ICV VERIFY_FIELDS (ICV, Padding, SPI, Seq.num) context - - - - - CONTEXT_ACCESS (update sequence number and mask) context - - - - seq. number and mask In case of extended sequence numbering, header is processed according to the Table C-18. In case of using extended sequence numbering, retrieving, and hashing of the ESP header is done by using the instructions listed in Table C-20. Table C-20. ESP Header Processing in case of ESN with AES-GMAC N Instruction Source of Data Destination Remove Hash Cipher Output Context RETRIEVE (hash and store SPI) input - ESP SPI - - ESP SPI (SPI offset) RETRIEVEa (store seq. num. low.) input - - - - seq. num. low. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 285 Appendix C: Miscellaneous Accelerator Specifications Table C-20. ESP Header Processing in case of ESN with AES-GMAC (continued) N a. Instruction Source of Data Destination Remove Hash Cipher Output Context INSERT (Estimated Seq. num. high) context - Seq. num. high - - - INSERT (Seq. num. low) context - Seq. num. low - - - This instruction will trigger estimation of upper sequence number. AH Outbound Flow The AH outbound processing is executed according to Table C-21. Table C-21. AH Outbound with AES-GMAC N Instruction Source of Data Destination Remove Hash Cipher Output Context REMOVE (Ethernet header) input Ethernet header - - - - INSERT (Muted IP header) token - Muted IP header - - - DIRECTION (non-muted IP header) input - - - IP header - INSERT token - AH header - AH header - (1st word of AH header) 286 1st word 1st word INSERT (SPI, Seq. num) context - SPI, Seq. num - SPI, Seq. num - INSERT (IV) context (IV1 offset) - IV - IV - REMOVE_RESULT (schedule removal of Y0 from the output) output buffer - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - INSERT (Zeroes as ICV field) instruction - Zero ICV - Zero ICV - INSERTa (zeroes to pad AH header) instruction - Zero padding - Zero padding - DIRECTION (Payload) input - Payload - Payload - INSERTb (seq. num. high) context - seq. num. high - - - Execution of REMOVE_RESULT instruction output buffer Y0 - - - - Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming Table C-21. AH Outbound with AES-GMAC (continued) N a. b. c. Instruction Source of Data Destination Remove Hash Cipher Output Context REPLACEc (replace zero field with result ICV) context - - - ICV - CONTEXT_ACCESS (update sequence number) context - - - - sequence number In case of IPv6, the length of the AH header should be a multiple of 64 bits, hence additional zero padding is necessary. This instruction can be combined with the previous instruction (insertion of zeroes to ICV place). Hash higher sequence number in case of extended sequence numbering. When processing big packets, the result ICV is appended to the output packet so that the host processor can perform a replace operation. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 287 Appendix C: Miscellaneous Accelerator Specifications AH Inbound Flow The AH inbound processing is executed according to Table C-22. Table C-22. AH Inbound with AES-GMAC N Instruction Source of Data Destination Remove Hash Cipher Output Context REMOVE (Ethernet header) input Ethernet header - - - - INSERT (Muted IP header) token - Muted IP header - - - DIRECTION (IP header) input - - - IP header - RETRIEVE input - AH header - - AH header (1st word of AH header) a. b. 288 1st word 1st word RETRIEVE (SPI, Seq. num) input - SPI, Seq. num - - SPI, Seq. num RETRIEVE (IV) input - IV - - IV (IV1 offset) REMOVE_RESULT (schedule removal of Y0 from the output) output buffer - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - RETRIEVE (Store ICV field in context) input - ICV - - ICV offset DIRECTIONa (zeroes to pad AH header) input - Zero padding - - - INSERT (Zeroes as ICV field) instruction - Zero ICV - - - DIRECTION (Payload – last AAD data) input - Payload - Payload - INSERTb (seq. num. high) context - seq. num. high - - - Execution of REMOVE_RESULT instruction output buffer Y0 - - - - VERIFY_FIELDS (ICV, SPI, Seq.num) context - - - - - CONTEXT_ACCESS (update sequence number and mask) context - - - - seq. num and mask In case of IPv6, length of AH header should be multiple of 64 bits, hence additional zero padding is necessary for the ICV calculation and it is part of the input packet. Hash high sequence number when extended sequence numbering is used. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Packet Processor — Programming C.3.5 SRTP/SRTCP Protocols C.3.5.1 Introduction The packet processor supports basic acceleration of SRTP/SRTCP protocols. Basic acceleration does not support full header processing. Therefore, the header should be generated by the host processor and included in the packet. The summary of supported features is listed in Table C-23. Table C-23. SRTP/SRTCP Functionality Functionality Inbound Outbound IP header Modification Modification UDP header Bypass Bypass SRTP/SRTCP Header processing Bypass Bypass IV processing From context From context MKI field (can be optional) Removal, Verification Insertion (from SPI) SRTP ROC From Token From Token SRTCP E+Index Removal Insertion (from Token) TAG (variable length) Verification Insertion Cipher algorithm Null-crypto, AES-ICM Hash Algorithm HMAC SHA1 C.3.5.2 Packet Format The SRTP packet format is shown in Table C-24. Table C-24. SRTP Packet Format SRTP Packet 0 1 V=2 2 3 4 P X CC 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 M PT Sequence Number Timestamp Synchronization Source (SSRC) Identifier Contribution Source (CSRC) Identifier RTP Extension (Optional) Payload RTP Padding RTP Pad Count SRTP MKI (Optional) Authentication Tag (Recommended) The Master Key Information (MKI) field is used by key management. The MKI field identifies the master key from which the session key(s) were derived that authenticate and/or encrypt the particular packet. Note that the MKI must not identify the SRTP cryptographic context. The MKI can be used by key management for the purposes of re-keying, identifying a particular master key within the cryptographic context The TAG field is used to carry message authentication data. The Authenticated Portion of an SRTP packet consists of the RTP header followed by the encrypted portion of the SRTP packet. Thus, if both encryption and authentication are applied, encryption must be applied before authentication Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 289 Appendix C: Miscellaneous Accelerator Specifications on the sender side and conversely on the receiver side. The authentication tag provides authentication of the RTP header and payload, and it indirectly provides replay protection by authenticating the sequence number. Note that the MKI is not integrity protected as this does not provide any extra protection. The SRTCP packet format is shown in Figure C-7. SRTCP Packet 0 1 V=2 2 3 P RC 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 PT=SR or RR Length PT=SDES=202 Length Sender SSRC Sender info Report block 1 Report block 2 … V=2 P SC SSRC / CSRC_1 SDES Items … E SRTCP Index SRTCP MKI (Optional) Authentication tag Figure C-7: SRTCP Packet Format The bit E is set when the current SRTCP packet is encrypted. The SRTCP index is a 31-bit counter for the SRTCP packet. The index is explicitly included in each packet, in contrast to the “implicit” index approach used for SRTP. The counter must be cleared to zero before the first SRTCP packet is sent, and must be incremented by one, modulo 2^31, after each SRTCP packet is sent. In particular, after a re-key, this field must not be reset to zero again. C.4 Context Control Words For SRTP/SRTCP processing the context control words must be configured correctly. The layout and allowable settings of the control words are shown in the figures that follow. Context – Control Word 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 290 - - 1 1 - - - - - - - - - - - - - options 0 length packet-based - ToP key 0 crypto algorithm 0 reserved 0 digest type 0 hash algorithm SEQ 0 reserved MASK0 0 SPI MASK1 context - - - - - - - Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words The applicable fields are listed in Table C-25. Table C-25. SRTP/SRTCP Context Control Word 0 Field Value Description hash algorithm 010 SHA1 HMAC. digest type 11 HMAC type of hash algorithm. crypto algorithm 101 AES-128. key 1 The Key is used in processing for cipher algorithms. 0 The Key is not used for Null-Crypto Mode. context length * See description in “context length” on page 437. packet based options 0000 Default value. ToP (Type of Packet) 0010 Outbound hash operation (for Null-Crypto Mode). 0110 Outbound encrypt-then-hash operation (for all other cipher algorithms). 0011 Inbound hash operation (for Null-Crypto Mode). 1111 Inbound hash-then-decrypt operation (for all other cipher algorithms). disable mask upd. seq. nbr. store state selection i-j-pntr hash store reserved enc. hash result pad type 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 digest cnt 0 - - - - - crypto mode reserved 0 Feedback reserved 0 IV0 reserved 0 IV1 reserved 0 IV2 reserved 0 IV3 reserved 0 IV format reserved 0 crypto-store 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 reserved 31 30 address mode Context – Control Word 1 - - - - The applicable fields are listed in Table C-26. Table C-26. SRTP/SRTCP Context Control Word 1 Field Value Description hash store 0 Do not store result digest into internal context register. pad type 000 Not used. crypto store 0 Do not store result IV (IV is unique for every packet). IV format 00 Use full IV mode for IV processing. digest Cnt. 0 Digest counter is not used. IV3..IV0 1111 16-byte IV (AES-ICM). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 291 Appendix C: Miscellaneous Accelerator Specifications Table C-26. SRTP/SRTCP Context Control Word 1 (continued) Field Value Description feedback mode 00 For AES. crypto mode 011 ICM with 16-bit counter initialized to zero. The 16-bit counter rollover should be detected by the host. C.4.0.1 Outbound Processing The updated UDP and RTP headers are provided as part of the input packet. The MKI field is taken from the SPI field of the context record. The outbound SRTP processing is executed according to Table C-27. Table C-27. SRTP Outbound Processing N Instruction Source of Data Destination Remove Hash Cipher Output Context IPV4_CKS or IVP6 (modify IP header) input - - - IP header - DIRECTION (IP address) input - - - IP address - DIRECTION (UDP header) input - - - UDP header - DIRECTION (RTP header) input - RTP header - RTP header - DIRECTION (Payload data) input - Payload Payload Payload - INSERT (hash ROC value) token - ROC - - - INSERT (optional MKI) context (SPI field) - - - MKI - INSERT (TAG field) context - - - TAG - The outbound processing for SRTP functions as follows: 292 1. The IPv4 or IPv6 header is updated with parameters of result packet (result length). 2. The IP address is passed to the output. 3. The UDP header is passed to the output. 4. The RTP header is passed to the output. 5. Payload data are encrypted and then hashed. 6. The ROC value is inserted from the input token into the hash stream. 7. The MKI field is inserted from the SPI field of context record. 8. Result TAG is appended to the end of the packet. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words The outbound SRTCP processing is executed according to Table C-28. Table C-28. SRTCP Outbound Processing N Instruction Source of Data Destination Remove Hash Cipher Output Context IPV4_CKS or IVP6 (modify IP header) input - - - IP header - DIRECTION (IP address) input - - - IP address - DIRECTION (UDP header) input - - - UDP header - DIRECTION (RTCP header) input - SRTCP header - SRTCP header - DIRECTION (Payload data) input - Payload Payload Payload - INSERT (E + Index value) token - E + Index - E + Index - INSERT (optional MKI) context (SPI field) - - - MKI - INSERT (TAG field) context - - - TAG - The outbound processing for SRTCP functions as follows: 1. The IPv4 or IPv6 header is updated with parameters of result packet (result length). 2. The IP address is passed to the output. 3. The UDP header is passed to the output. 4. The SRTCP header is hashed and passed to the output. 1. Payload data are encrypted and then hashed. 2. The E bit and Index value are inserted from the input token into the hash stream and to the output. 3. The MKI field is inserted from the SPI field of context record. 4. Result TAG is appended to the end of the packet. C.4.0.2 Inbound Processing The UDP and RTP headers are not modified by packet processor during inbound transform. The inbound SRTP processing is executed according to Table C-29. The inbound processing for SRTP functions as follows: 1. The IPv4 or IPv6 header is updated with parameters of result packet (result length). 2. The IP address is passed to the output. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 293 Appendix C: Miscellaneous Accelerator Specifications Table C-29. SRTP Inbound Processing N 294 Instruction Source of Data Destination Remove Hash Cipher Output Context IPV4_CKS or IVP6 (modify IP header) input - - - IP header - DIRECTION (IP address) input - - - IP address - DIRECTION (UDP header) input - - - UDP header - DIRECTION (RTP header) input - RTP header - RTP header - DIRECTION (Payload data) input - Payload Payload Payload - INSERT (hash ROC value) token - ROC - - - RETRIEVE (optional MKI) input (SPI result) - - - - MKI RETRIEVE (store TAG in the context) input - - - - TAG VERIFY (calculated TAG) context (hash result) - - - - TAG 3. The UDP header is passed to the output. 4. The RTP header is hashed and passed to the output. 5. Payload data are hashed and at the same time decrypted. Decrypted payload is send to the output. 6. The ROC value is inserted from the input token into the hash stream. 7. The MKI is retrieved from the input and stored in SPI field of the context record. 8. The TAG is retrieved from the input and stored in the result digest field. 9. Calculated TAG is compared with the retrieved TAG in the context. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words The inbound SRTCP processing is executed according to Table C-30. Table C-30. SRTCP Inbound Processing N Instruction Source of Data Destination Remove Hash Cipher Output Context IPV4_CKS or IVP6 (modify IP header) input - - - IP header - DIRECTION (IP address) input - - - IP address - DIRECTION (UDP header) input - - - UDP header - DIRECTION (RTP header) input - RTP header - RTP header - DIRECTION (Payload data) input - Payload Payload Payload - INSERT (hash E + Index value) token - E + Index - - - RETRIEVE (optional MKI) input (SPI result) - - - - MKI RETRIEVE (store TAG in the context) input - - - - TAG VERIFY (calculated TAG) context (hash result) - - - - TAG The inbound processing for SRTCP functions as follows: 1. The IPv4 or IPv6 header is updated with parameters of result packet (result length). 2. The IP address is passed to the output. 3. The UDP header is passed to the output. 4. The SRTCP header is hashed and passed to the output. 5. Payload data are hashed and at the same time decrypted. Decrypted payload is send to the output. 6. The E bit and Index are inserted to the hash stream and removed from the packet. 7. The MKI is retrieved from the input and stored in SPI field of the context record. 8. The TAG is retrieved from the input and stored in the result digest field. 9. Calculated TAG is compared with the retrieved TAG in the context. C.4.1 MACsec Protocol C.4.1.1 Introduction The packet processor supports inbound and outbound packet processing for MACsec. Support of MACsec is implemented according to Table C-31. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 295 Appendix C: Miscellaneous Accelerator Specifications Table C-31. MACsec Functionality Functionality Inbound Outbound Header processing Removal Insertion: STI from token, PN and SCI from context IV processing • From input header (with SCI) • From input header and context (without SCI) From Context Packet number Verification Generation. Overflow check. ICV (16-byte) Verification Insertion Confidentiality offset Supported Supported Cipher suites • Integrity and confidentiality (AES-GCM) • Integrity only (AES-GMAC) C.4.1.2 Packet Format The format of MACsec packet is shown in Figure C-8. MAC Protected Data Unit (MPDU) 6-Byte 6-Byte macDA MSDU 16-Byte (8-Byte without optional SCI) macSA SecTAG Secure/User data 4-Byte: SecTAG Information (STI) 2-Byte 1b ET V Cur 16-Byte only 1-Byte: TAG Control Information (TCI) 1b 1b 1b 1b 1b ES SC SCB E ICV C 2b 1-Byte AN SL 4-Byte 8-Byte PN SCI Secure Channel Identifier (optional) Packet Number <48: Data length 0: Length is provided by LMI Association number Changed text Encryption Single Copy Broadcast (EPON) SCI included in SecTAG End station 0: Version number 88-E5: MACSec EtherType Figure C-8: MACsec Packet Format 296 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words Protection of MACsec packet is shown in Figure C-9. Integrity mac mac DA SA SecTAG Payload ICV Conf offset Confidentiality Figure C-9: MACsec Protection The way data are processed by the crypto engine is shown in Figure C-10. Input frame Integrity mac mac DA SA SecTAG Secure/User data ICV Output frame C onf offset mac mac DA SA Confidentiality AAD length Inbound flow SecTAG Secure/User data ICV 0 0 A (Additional authentication data) 1 P (Plain text) (Byte count > AAD length) AN D NOT(integrity only) GCM-AES-128 (Byte count > AAD length ) AND NOT(integrity only ) SAK K[127:0] SC I IV[95:32] PN IV[31:0] 1 C (C hipher data ) outbound flow T (128 b) K (secret key 128b ) IV (Initialization Vector 96 b) Inbound flow = Inbound fram e IC V VALID Note: SecTAG has reversed field order Figure C-10: AES-GCM/GMAC-128 Data Flow C.4.1.3 Context Control Words MACsec processing requires the control words to be configured correctly. The layout and allowable settings of the control words are shown in the figures below. Context – Control Word 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 - - 1 0 - - - 1 - - - - - - - - - options 0 length packet-based - ToP key 0 crypto algorithm 0 reserved 1 digest type 0 hash algorithm SEQ - reserved MASK0 - SPI MASK1 context - - - - - - - The applicable fields are listed in Table C-32. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 297 Appendix C: Miscellaneous Accelerator Specifications Table C-32. MACsec Context Control Word 0 Field Value Description MASK1, MASK0 00 Outbound processing does not use mask. 01 Inbound processing with 64-bit mask. 10 Inbound processing with 32-bit mask. 11 Inbound processing with 128-bit mask. SEQ 01 Use 32-bit packet number. SPI 0 SPI field is not used in processing. hash algorithm 100 GHASH. digest type 10 Use dedicated hash algorithm (GHASH). crypto algorithm 101 AES-128. key 1 The Key is used in processing. context length * See description in “context length” on page 437. packet based options 0000 Default value. ToP (Type of Packet) 0110 Encrypt-then-hash for outbound operation (AES-GCM). 1110 hash-then-encrypt for outbound operation (AES-GMAC) 1111 hash-then-decrypt for inbound operation (AES-GCM/GMAC) Context – Control Word 1 298 state selection i-j-pntr hash store reserved enc. hash result pad type 0 0 0 1 0 1 0 0 0 0 0 0 0 0 - - - - 0 00 crypto mode seq. nbr. store 1 Feedback disable mask upd. 0 IV0 reserved 0 IV1 reserved 0 IV2 reserved 0 IV3 reserved 0 10 09 08 07 06 05 04 03 02 01 digest cnt reserved 0 IV format reserved 0 crypto-store reserved 0 reserved address mode 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 0 0 1 0 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words The applicable fields are listed in Table C-33. Table C-33. MACsec Context Control Word 1 Field Value Description disable mask update 0 For outbound and for inbound ‘out of order’ mode. 1 For inbound, mask update is disabled (sliding window only). hash store 0 Do not store result digest into internal context register. enc hash result 1 Use encrypted hash result. pad type 000 Padding is not used. crypto store 0 Do not store result IV. IV format 00 Full IV mode. IV3...IV0 0111 12-byte IV (AES). crypto mode 010 AES-CTR with counter initialized to 1. C.4.1.4 Outbound Processing Input Data (Outbound) In order to process outbound MACsec packets, the following data must be provided to the packet processor: • 12-byte MAC address • 16-byte SecTAG • 4-byte SecTAG information (STI) – composed of EtherType, TCI/AN, SL fields • 4-byte packet number • 8-byte Secure Channel Identifier (SCI) • User Data • Confidentiality offset • 16-byte AES-GCM secret key (SAK) • 16-byte AES-GCM hash key The STI field is inserted from the token. The layout of the inserted word is shown in Figure C-11. STI Field from Token 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 MACsec EtherType Byte 2 MACsec EtherType Byte 1 AN C E SCB SC ES V SL (optional) Figure C-11: MACsec SPI Field Layout Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 299 Appendix C: Miscellaneous Accelerator Specifications The 4-byte packet number is inserted from the incremented value of context field sequence number. At the beginning of the session, the sequence number field in the context must be set to 0. This allows inserting values from 1 to 232 inclusive. When increment operation results in overflow of the sequence number, the sequence number overflow error (E10) is generated (see Table D-8, “Error Codes,” on page 430). The SCI field, when present, is taken from the token. Data Flow The MACsec outbound processing with AES-GCM is executed according to Table C-34. Table C-34. MACsec Outbound with AES-GCM N Instruction Source of Data Destination Remove Hash Cipher Output Context DIRECTION (pass MAC address to the output) input - MAC address - MAC address - INSERT (insert STI field) token - STI - STI - INSERT_CTX (insert packet number) context (seq_num_ res) - PN - PN (IV2 offset) SCI - SCI (IV0 offset) INSERT_CTX (insert SCI) a. token a REMOVE_RESULT (schedule removal of Y0 from the output) output buffer - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTION (user data – integrity only) input - User Data Int. only - User Data Int. only - DIRECTION (user data – confidentiality and integrity) input - User Data conf. and int. User Data conf. and int. User Data conf. and int. - Execution of REMOVE_RESULT instruction output buffer Y0 - - - - INSERT (ICV field) context - - - ICV - CONTEXT_ACCESS (write updated packet number) context (seq. num.) - - - - Packet Number This instruction is necessary when SCI field is present in the MACsec packet. The outbound processing functions as follows: 300 1. The MAC address is passed directly to the output stream. 2. The STI field is inserted from the token. 3. The packet number is inserted from the context. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words 4. (incremented value) and also used as part of the IV vector. 5. In case when SCI field is preset in the stream, SCI field is inserted from the input token to the output stream, hashed and used as part of IV. 6. Schedule an instruction to remove the Y0 from the output buffer. 7. A block of zeros is inserted in the cipher stream to create encrypted Y0. 8. The payload data, which is protected only by integrity are hashed and passed to the output. 9. The payload data, which are protected by confidentiality and integrity are encrypted, hashed and then passed to the output stream. 10. The encrypted Y0 block is removed from the output buffer. 11. The result ICV is appended to the output stream. 12. The incremented packet number is written back to the context memory. C.4.1.5 Inbound Processing Input Data (Inbound) In order to process an inbound MACsec packet, the following data must be provided to the packet processor: • Inbound MACsec packet • Confidentiality offset • 16-byte AES-GCM secret key • 16-byte AES-GCM hash key The verification of packet number is done as follows: The packet processor supports out or order packet numbers within the specified window. This functionality is implemented using IPSec replay protection logic except that the sequence mask is not updated. By programming the sequence number mask field, it is possible to specify any window from 1 to 128. The four mask fields of the context should be programmed with a number based on this formula: Mask[N-1:0]={N{1’b1}}<<“window size” Note: According to the [MACsec] standard, receive packets with packet number equal to zero are not allowed. This check must be done by the host before providing packets to the packet processor. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 301 Appendix C: Miscellaneous Accelerator Specifications Data Flow The MACsec inbound processing with AES-GCM is executed according to Table C-35. Table C-35. MACsec Inbound with AES-GCM N a. Instruction Source of Data Destination Remove Hash Cipher Output Context DIRECTION (pass MAC address to the output) input - MAC address - MAC address - DIRECTION (hash and remove input - STI - - - RETRIEVE (retrieve and hash packet number) input - PN - - PN (result seq.num.) RETRIEVE a (retrieve SCI) input - SCI - - (IV0 offset) INSERT_CTX Use packet number a part of IV context (result seq.num) - - - - (IV2 offset) REMOVE_RESULT (schedule removal of Y0 from the output) - - - - - - INSERT (zeroes for Y0 generation) instruction - - Block of zeroes Block of zeroes - DIRECTION (user data – integrity only) input - User Data Int. only - User Data Int. only - DIRECTION (user data – confidentiality and integrity) input - User Data conf. and int. User Data conf. and int. User Data conf. and int. - Execution of REMOVE_RESULT instruction output buffer Y0 - - - - RETRIEVE (store ICV field in the context) input - - - - hash result offset VERIFY (calculated ICV and PN) context - - - - - CONTEXT_ACCESS a) (update packet number) context - - - - packet number STI field) This instruction is necessary when SCI field is present in the MACsec packet. The inbound processing functions as follows: 302 1. The MAC address is passed directly to the output stream. 2. The STI field is hashed and removed from the packet. 3. The packet number is retrieved from the input, hashed and stored in context for later check. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words 4. In case when SCI field is preset in the stream, SCI field is retrieved from the input, hashed and used as part of the IV. Otherwise this instruction should be NOP instruction, because the context register is read immediately after writing. 5. Packet number is copied to IV0 field of the context. 6. Schedule an instruction to remove the Y0 from the output buffer. 7. A block of zeros is inserted in the cipher stream to create encrypted Y0. 8. The payload data, which is protected only by integrity are hashed and passed to the output. 9. The payload data, which are protected by confidentiality and integrity are decrypted, hashed and then passed to the output stream. 10. The encrypted Y0 block is removed from the output buffer. 11. The packet ICV is retrieved from the input and stored in context for comparison. 12. The calculated ICV is compared with ICV from the packet. The packet number is checked according to Table D-41 on page 488. 13. The updated packet number is written back to the context memory. Inbound Checks The MACsec processing token can contain instructions for performing a number of inbound checks. The available checks are: • Sequence number check – out of the window check without replay protection. An out of window packet failure will cause a Sequence number failure error. • ICV check – the calculated ICV value during inbound processing is compared with the ICV value received from the input stream. In case of a mismatch, the authentication failure error is generated. C.4.2 DTLS Protocol C.4.2.1 Introduction The packet processor supports DTLS protocol without length field processing. This is because the packet processor is designed as a stream processor and does not have the ability to “look to the end of the packet” to decrypt the last two words. Therefore, in the case of block ciphers, and to allow single-pass processing, the external host must have corrected the length field before submitting the packet to the packet processor. C.4.2.2 Supported Features Support of DTLS processing is implemented according to Table C-36. Table C-36. DTLS Functionality Functionality Inbound Outbound Header Removal Checking type and version, epoch Insertion Length field processing By the host By the host IV processing From input Insertion from context Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 303 Appendix C: Miscellaneous Accelerator Specifications Table C-36. DTLS Functionality (continued) Functionality Inbound Outbound Sequence number Verification with 64-bit or 128-bit mask Generation. Overflow check. Fragment compression/decompression Null Null MACa Verification Insertion Crypto padding Removal and verification Insertion Cipher algorithms Null-crypto DES, 3DES AES-CBC (128, 192, 256-bit key) Hash HMAC-MD5 HMAC-SHA1 (optional SHA2) a. Message Authentication Code (MAC). C.4.2.3 Packet Format The format of DTLS packet is shown on Figure C-12. 1-Byte Type 2-Byte Version 2-Byte Epoch 6-Byte Sequence Number 2-Byte Length (of fragment ) Fragment 1-Byte IV Payload MAC Padding L( pad) Block size 0 <= L(pad) <= 255 Value = L (pad) md5 -16 bytes sha -20 bytes 20 21 22 23 <= 2^14 for Plaintext fragment <= 2^14 + 1024 for Compressed fragment <= 2^14 + 2048 for Encrypted fragment Record sequence number Chiper state change counter value {254 ,255} – DTLS 1.0 - change _cipher _spec - alert - handshake - application _data Figure C-12: DTLS Packet Format C.4.2.4 Context Control Words DTLS processing requires the context control words to be configured correctly. The layout and allowable settings of the control words are shown in the figures that follows. Context – Control Word 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 304 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words - - 1 1 - - - - - - - - - - - - - options 0 length packet-based - ToP key 0 crypto algorithm 1 reserved 0 digest type 1 hash algorithm SEQ - reserved MASK0 - SPI MASK1 context - - - - - - - The applicable fields are listed in Table C-37. Table C-37. DTLS Context Control Word 0 Field Value Description MASK1, MASK0 00 Outbound processing does not use mask. 01 Inbound 64-bit mask. 11 Inbound 128-bit mask. SEQ 10 Use 48-bit sequence number. SPI 1 SPI value is used in processing. Hash Algorithm 000 MD5 HMAC. 010 SHA1 HMAC. Digest Type 11 HMAC type of hash algorithm is used. Crypto Algorithm * Select applicable crypto algorithm (see Table D-10, “Control Word 0 Field Encoding,” on page 437). Key 1 The Key is used in processing for cipher algorithms. 0 The Key is not used for Null-Crypto Mode. * See description in (see “context length” on page 437). 0000 Default value. 0010 Outbound hash operation (for Null-Crypto Mode). 1110 Outbound hash-encrypt operation (for all other cipher algorithms). 0011 Inbound hash operation (for Null-Crypto Mode). 0111 Inbound decrypt-hash operation (for all other cipher algorithms). Context Length Packet Based Options ToP (Type of Packet) Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 305 Appendix C: Miscellaneous Accelerator Specifications Context – Control Word 1 state selection i-j-pntr hash store reserved enc. hash result pad type 1 0 0 1 0 0 - - - 0 1 0 0 0 - - - - - crypto mode seq. nbr. store 0 Feedback disable mask upd. 0 IV0 reserved 0 IV1 reserved 0 IV2 reserved 0 IV3 reserved 0 digest cnt reserved 0 IV format reserved 0 crypto-store reserved 0 reserved address mode 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 - - - - The applicable fields are listed in Table C-38. Table C-38. DTLS Context Control Word 1 Field Value Description Seq. Num. Store 1 Disable estimation of sequence number. Hash Store 1 Store result digest into internal context register. Pad Type 101 TLS Pad Type. 110 SSL Pad Type. 000 No padding for Null-crypto case. Crypto Store 1 Store IV/ARC IJ pointer back in context. IV Format 00 Use full IV mode for IV processing. Digest Cnt. 0 Digest counter is not used. IV3…IV0 0000 No IV (for Null-Crypto Mode). 0011 8-byte IV (DES, 3DES). 1111 16-byte IV (AES). Feedback Mode * See Appendix D: on page 371. Crypto Mode * See Appendix D: on page 371. C.4.2.5 Outbound Processing Introduction This chapter explains how DTLS outbound processing can be done in the packet processor. The DTLS outbound processing is shown in Figure C-13. 306 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words Outbound Input Type Version Epoch Seq Num Len (pay load) Payload Move before hash Integrity Epoch Hash Seq Num Type Version Len (pay load) Payload MAC Insert hashing result Confidentiality IV Encrypt Payload MAC Padding Len (pad) Len (frag) Append after encrypt Replace Len before transmit Outbound Output Type Version Epoch Seq Num Len (frag) Fragment Figure C-13: DTLS Outbound Processing The DTLS outbound is a hash-encrypt type of processing, where the calculation of hash value is done and then part of the packet and calculated hash value are encrypted. Input Data (DTLS Outbound) The following data must be provided by the host to the packet processor in order to perform outbound DTLS processing: • Payload data • Header data • Type and version fields • Epoch/Sequence number • Length of the payload data • Length of the result fragment • Padding length • Cipher and hash keys • Packet IV for block cipher The Type and Version fields are taken from the SPI field of the context record. The layout of SPI field is shown below. Context – SPI 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 Type Version Major Version Minor 0000 0000 The 2-byte epoch and 6-byte (48-bit) sequence number are taken from two sequence number fields of the context record. The epoch value is placed in Sequence number 1 field as shown below. Context – Sequence number 1 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 Epoch Sequence number[47:32] Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 307 Appendix C: Miscellaneous Accelerator Specifications In the case of context reuse, sequence number is auto incremented. The overflow of the 48-bit sequence number results in sequence number overflow error (E10) (see Table D-8, “Error Codes,” on page 430). The three length fields must be pre-calculated and provided via the token fields: • Length of the payload data – used for MAC calculation. • Length of the padding data – used for insertion of padding data, so that the encrypted data have length multiple of the cipher block size. In case of Null-crypto, padding is not necessary. • Length of the result fragment – the value is transmitted to the output as a fragment length and is calculated by formula: fragment_len = len(IV) + len(payload) + len(MAC) + len(total_pad_sequence). When a block cipher is used, the IV is taken from the IV field of the context record. This IV is inserted as part of the fragment. Data Flow The DTLS outbound processing is executed according to Table C-39. Table C-39. DTLS Outbound Processing N Instruction Source of Data Destination Remove Hash Cipher Output Context DIRECTION (Bypass data) input - - - Bypass - INSERT (epoch and seq. number) context - epoch and seq. num - - - INSERT (lower seq. number) context - lower seq. num - - - INSERT (type and version) context - type and version - type and version - INSERT (epoch and seq. number) context - - - epoch and seq. num - INSERT (lower seq. number) context - - - lower seq. num - INSERT (payload length) token - payload length - - - INSERT (fragment length) token - - - fragment length - INSERT context - - - IV - DIRECTION (payload) input - payload payload payload - INSERT (MAC result) context - - MAC MAC - (IV) 308 a Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words Table C-39. DTLS Outbound Processing (continued) N Instruction INSERT (padding) Destination Remove Hash Cipher Output Context instruction - - padding padding - context - - - - Sequence number (1) CONTEXT_ACCESS (update sequence number) a. Source of Data In case of block cipher. Outbound processing functions as follows: 1. Bypass data are directly copied to the output stream. 2. Epoch and sequence number are inserted in the output stream. 3. Type and version is provided at the same time to the hash engine and to the output stream. 4. Epoch and sequence number are provided to the hash engine. 5. Payload length field, provided by the host, is inserted in the hash engine. 6. Fragment length, provided by the host, is inserted in the output stream. 7. In case of block cipher, IV is inserted to the output stream. 8. Payload data is passed to hash engine and to the crypto engine. The encrypted payload is passed to the output. 9. The calculated MAC value is provided to the crypto engine. The encrypted MAC is passed to the output stream. 10. Optional padding data are provided to the crypto engine. The encrypted padding is passed to the output stream. 11. The incremented sequence number is written back to the context record in memory. C.4.2.6 Inbound Processing Introduction The DTLS inbound processing is shown in Figure C-14. Inbound Input Type Version Epoch Seq Num Len (frag) Fragment Decrypt last two block’s and extract last byte Move before hash Decrypt phase 1 Len (pad) Len (pay load) = Len (frag) - Len (IV ) - Len (MAC) - Len (pad) - 1 Hash Epoch Seq Num Type Version Len (pay load) Payload MAC Append after decrypt Decrypt phase 2 Inbound output Fragment Payload IV Payload RX. MAC Padding Len (pad) After decrypt Figure C-14: DTLS Inbound Processing Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 309 Appendix C: Miscellaneous Accelerator Specifications The Decrypt phase 1 is used in the case of block cipher and is executed by the host. Input Data (DTLS Inbound) The following data must be provided by the host to the packet processor in order to perform inbound processing: • Incoming packet • Header checking information • Type, Version fields • Epoch value • 48-bit maximum received sequence number and current mask value • Pre-calculated length of the payload data • Cipher and hash keys • Packet IV for block cipher The Type and Version fields for checking are taken from the SPI field of the context record. The layout of SPI field is the same as described in “Input Data (Outbound)” on page 299. The packet processor can check if type and version match the values from inbound packet. When a mismatch is detected, SPI check error is generated. The Epoch value and maximum received 48-bit sequence number are taken from two sequence number fields of the context record as described in “Input Data (Outbound)” on page 299. Two length fields must be pre-calculated by host and provided to the packet processor via the token fields: • The pad length – is obtained after decrypting the two last blocks of data, when block cipher algorithm is used. In the case of Null-crypto, padding is not necessary and decryption of last two blocks is not required. • Length of the payload data – is used for MAC calculation. This length is calculated using the following formula: payload_len = len(fragment) – len(IV) – len(MAC) – len(total_pad_sequence). The IV for the block cipher algorithm is taken from IV field of context record. This IV is retrieved from the inbound packet. 310 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words Data Flow The DTLS inbound processing is executed according to Table C-40. Table C-40. DTLS Inbound Processing N Instruction and Explanation Source of Data Destination Remove Hash Cipher Output Context DIRECTION (Bypass data) input - - - Bypass - RETRIEVE (store Type/Version in the context) input - - - - Type/Version (SPI offset) RETRTIEVE (epoch/seq.num) input - epoch/ seq. num - - epoch/ seq. num RETRTIEVE (lower seq.num) input - lower seq. num - - lower seq. num INSERT (insert stored type/version) context - type/ version - - - INSERT (payload length) token - payload length - REMOVE (fragment length) input fragment length - - - - input - - - - IV (IV offset) REMOVE_RESULT (store position for removal ICV from the output buffer) - - - - - - DIRECTION (Payload) input - Payload Payload Payload - DIRECTION b (MAC + padding) input - - MAC + padding MAC + padding - Execution of REMOVE_RESULT instruction output buffer MAC - - - - VERIFY_FIELDS (verify seq.num, SPI, padding, MAC) - - - - - CONTEXT_ACCESS (update sequence number, sequence mask in context record in memory) context - - - Sequence number, mask RETRIEVE (IV) a. b. a) - - Only in case of block cipher. For this instruction, length of padding must be known at the beginning (see “Input Data (DTLS Inbound)” on page 310). The inbound processing functions as follows: 1. The bypass data are directly copied to the output stream. 2. Type and version from the inbound packet are stored in SPI field of the context. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 311 Appendix C: Miscellaneous Accelerator Specifications 3. Epoch and sequence number are retrieved from the input packet, stored in the context and passed to hash engine. 4. Stored type and version are provided to the hash engine. 5. Payload length field is taken from token and passed to the hash engine. 6. Fragment length is removed from the input stream. 7. The IV is retrieved from the input stream and stored in the context record. 8. Remember the position of the MAC field in the output stream. 9. Decrypt payload and then provide payload to the hash engine and also to the output stream. 10. Decrypt MAC and padding and then insert them to the output stream. 11. Verify SPI, sequence number, padding data and also compare calculated MAC with retrieved MAC from the input. 12. In case of successful processing, the result sequence number and mask are written back to the context record in memory. Inbound Checks The DTLS inbound token can contain instructions for performing a number of inbound checks. The available checks are: • Type and version check – inbound Type and version fields are checked against Type and version fields stored in the SPI field of the context. In the case of mismatch, the SPI check failure error is generated. • Sequence number check. Sequence number field contains of two parts – epoch and actual sequence number. During sequence number check, received epoch number is checked against epoch number in the context. Also at the same time, the 48-bit sequence number is checked for a replay condition. In the case of any of these two checks failing, the sequence number check error is generated. • Pad verification – for block ciphers, sequence of padding bytes after decryption can be detected and removed. Wrong padding sequence causes pad verification failure. • MAC check – the calculated MAC value during inbound processing is compared with the MAC value received from the input stream. In case of a mismatch, the authentication failure error is generated. C.4.3 SSL/TLS Protocol C.4.3.1 Introduction The packet processor supports SSL3.0/TLS1.0/TLS1.1/TLS1.2 protocols without length field processing. This is due to the fast that the packet processor is designed as a stream processor and does not have the ability to ‘look to the end of the packet’ to decrypt the last two words. Therefore, to allow single-pass inbound processing, in the case of block ciphers, the external host must: • Decrypt last two blocks of the packet • Extract the padding information from the decrypted data • Calculate payload length field For single-pass outbound processing, host must pre-calculate fragment length, based on used cipher and hash algorithms. 312 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words C.4.3.2 Supported Features Support of SSL/TLS processing is implemented according to Table C-41. Table C-41. Functionality for Processing TLS/SSL Packets Functionality Inbound Outbound Header processing Removal Check type and version. Insertion from context. IV processing From Fragment (TLS1.1/TLS1.2) From Context (SSL/TLS1.0). From Context. Sequence number Overflow check. Generation. Overflow check. Fragment compression/decompression Null. Null. MAC Verification. Insertion. Crypto padding Removal and verification. Pad length is checked by the host. Insertion. Cipher algorithms Null-crypto, ARC4 with key length from 40 to 128, DES, 3DES, AES-CBC (128, 192, 256-bit key). Hash MD5, SHA1 (SSL-MAC for SSL, HMAC for TLS). C.4.3.3 Packet Format The combined packet format for SSL/TLS is shown in Figure C-15. 1-Byte 2-Byte 2-Byte Type Version Length (of fragment ) Fragment 1-Byte IV Payload MAC Padding TLS 1.1 only : Block size L( pad) SSL: 0 <= L(pad) < Block Size TLS: 0 <= L (pad) <= 255 for block ciphers only md5 – 16 bytes sha – 20 bytes SSL: Value = L(pad) TLS: Value = L (pad) for block ciphers only <= 2^14 for Plaintext fragment {3,0} – SSL 3.0 {3,1} – TLS 1. 0 <= 2^14 + 1024 for Compressed fragment <= 2^14 + 2048 for Encrypted fragment {3,2} – TLS 1. 1 20 - change _cipher _spec 21 - alert 22 - handshake 23 - application _data Figure C-15: SSL/TLS Packet Format The difference between SSL and TLS packet formats is that TLS1.1 and TLS1.2 packets contain an explicit IV field for block cipher algorithms. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 313 Appendix C: Miscellaneous Accelerator Specifications C.4.3.4 Context Control Words SSL/TLS processing requires the context control words to be configured properly. The layout and allowable settings of the control words are shown in the figure below. SSL/TLS Context Control Word 0 Context – Control Word 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 - - 1 1 - - - - - - - - - - - - - options 0 length packet-based - ToP key 0 crypto algorithm 1 reserved 1 digest type 1 hash algorithm SEQ 0 reserved MASK0 0 SPI MASK1 context - - - - - - - The applicable fields are listed in Table C-42. Table C-42. SSL/TLS Context Control Word 0 Field Value Description SEQ 11 Use 64-bit sequence number. SPI 1 SPI value is used in processing. Hash Algorithm 000 MD5 HMAC/SSL-MAC. 001 SHA1 SSL-MAC. 010 SHA1 HMAC (used for SSL-MAC). Digest Type 11 HMAC type of hash algorithm is used. Crypto Algorithm * Select applicable crypto algorithm (see Table D-10, “Control Word 0 Field Encoding,” on page 437). Key 1 The Key is used in processing for cipher algorithms. 0 The Key is not used for Null-Crypto Mode. * See description in “context length” on page 437. 0000 Default value. 0010 Outbound hash operation (for Null-Crypto Mode). 1110 Outbound hash-then-encrypt operation (for all other cipher algorithms). 0011 Inbound hash operation (for Null-Crypto Mode). 0111 Inbound decrypt-then-hash operation (for all other cipher algorithms). Context Length Packet Based Options ToP (Type of Packet) 314 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words SSL/TLS Context Control Word 1 Context – Control Word 1 state selection i-j-pntr hash store reserved enc. hash result pad type - - - 1 0 0 - - - 0 1 0 0 0 - - - - - crypto mode seq. nbr. store 0 Feedback disable mask upd. 0 IV0 reserved 0 IV1 reserved 0 IV2 reserved 0 IV3 reserved 0 digest cnt reserved 0 IV format reserved 0 crypto-store reserved 0 reserved address mode 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 - - - - The applicable fields are listed in Table C-43. Table C-43. SSL/TLS Context Control Word 1 Field Value Description Seq. Mum. Store 1 Disable estimation of sequence number. State Selection 0 ARC4 stateless mode. 1 ARC4 statefull mode. 0 ARC4 I-J Pointer is not used. 1 ARC4 I-J Pointer is available. Hash Store 1 Store result digest into internal context register. Pad Type 101 TLS Pad Type. 110 SSL Pad Type. 000 No padding for ARC4 and Null-crypto. Crypto Store 1 Crypto state is saved for the next packet. IV Format 00 Use full IV mode for IV processing. Digest Cnt. 0 Digest counter is not used. IV3..IV0 0000 No IV (for Null-Crypto Mode, ARC4). 0011 8-byte IV (DES, 3DES). 1111 16-byte IV (AES). Feedback Mode * See Appendix D: on page 371. Crypto Mode * See Appendix D: on page 371. I-J Pointer C.4.3.5 SSL MAC The SSL protocol uses SSL-MAC authentication algorithm. The SSL-MAC is a two stage authentication algorithm using SHA1 and MD5 hash functions. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 315 Appendix C: Miscellaneous Accelerator Specifications For SSL-MAC, the following sequence of calculations must be done: hash(MAC_KEY + pad_2 + hash(MAC_KEY + pad_1 + authenticated_message)) where: • + Denotes concatenation • pad_1 The character 0x36 repeated 48 times for MD5 or 40 times for SHA1 • pad_2 The character 0x5c repeated 48 times for MD5 or 40 times for SHA1 • hash Hashing algorithm derived from the cipher suite Since pad_1 and pad_2 have different lengths for MD5 and SHA1, the packet processor handles SSL-MAC in two different ways: • For MD5, the first inner hash block (MAC_KEY + pad_1) must be pre-calculated by the host and provided as inner digest in context record, in the same way as for HMAC processing. The first outer hash block (MAC_KEY + pad_2) must be pre-calculated by the host and provided as outer digest in context record. • For SHA1, only the MAC_KEY is provided in the context record (inner digest field). The rest of the calculations are done internally. C.4.3.6 Outbound Processing Introduction This chapter explains how SSL/TLS outbound processing can be done in the packet processor. The combined outbound processing for SSL/TLS is shown in Figure C-16. SSL/TLS packet Outbound Input Type Version Len (pay load) Payload Insert before hash Hash Seq Num Type Skip for SSL Version Len (pay load) Payload MAC Insert hashing result For block ciphers only TLS 1.1 (3)DES and AES only Encrypt Outbound Output IV Payload MAC Padding Len (pad) Append after encrypt Type Version Len (frag) Fragment Len (frag) Replace payload length with fragment length before transmit Figure C-16: SSL/TLS Outbound Processing The SSL/TLS outbound is a hash-encrypt type of processing, where the calculation of the hash value is done and then part of the packet is encrypted. 316 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words Input Data (SSL/TLS Outbound) The following data must be provided by the host to the packet processor in order to perform outbound processing: • Payload data • Header data • Type and version fields • Length of the payload data • Length of the result fragment • 64-bit sequence number • Cipher and hash keys • Packet IV and padding length for block cipher The Type and Version fields are taken from the SPI field of the context record. The layout of SPI field is shown on the figure below. Context – SPI 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 Type Version Major Version Minor 0000 0000 The 8-byte (64-bit) sequence number is taken from two sequence number fields of the context record (non-incremented). The internal sequence number counter is incremented for use in next packet. The overflow of the 64-bit sequence number results in a sequence number overflow error (E10) (see Table D-8, “Error Codes,” on page 430). The three length fields must be pre-calculated and provided via the token fields: • Length of the payload data – used for MAC calculation. • Length of the padding data – used for insertion of padding data, so that encrypted data will have a length multiple of the cipher block size. • Length of the result fragment – the value is transmitted to the output as a fragment length and is calculated as: len(fragment) = len(IV, for TLS1.1 and TLS1.2) + len(payload) + len(MAC) + len(pad) + 1. Note: That expression len(pad)+1 is the length of the total pad sequence. The IV for the block cipher is taken from IV field of the context record. For the TLS1.1 and TLS1.2 protocols, this IV is inserted as part of the fragment. Data Flow The data flow and instructions for outbound SSL/TLS processing are shown in Table C-44. Outbound processing functions as follows: 1. Bypass data are directly copied to the output stream. 2. Sequence number is inserted in the hash engine. 3. Type and version is provided at the same time to the hash engine and to the output (see Note 1 in the Table C-44). 4. Payload length field, provided by the host, is provided to the hash engine. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 317 Appendix C: Miscellaneous Accelerator Specifications Table C-44. Outbound SSL/TLS Processing Flow N a. b. c. Instruction Source of Data Destination Remove Hash Cipher Output Context DIRECTION (Bypass data) input - - - Bypass - INSERT (seq. number) context - seq. number - - - INSERT a (type and version) context (SPI field) - type and version - type and version - INSERT (payload length) token - payload length - - - INSERT (fragment length) token - - - fragment length - INSERT b (IV for block cipher) context - - - IV - DIRECTION (payload) input - payload payload payload - INSERT (MAC result) context - - MAC MAC - INSERT (padding only for block ciphers) instruction - - padding padding - CONTEXT_ACCESS c (update crypto state) context - - - - crypto state CONTEXT_ACCESS (update sequence number) context - - - - Sequence number For SSL packets, the ‘Version’ field is not inserted into the hash stream, but still transmitted. Therefore this instruction should be split in two insert instructions – one to hash and transmit Type field and the other to transmit the Version field. The IV is inserted only for TLS1.1 and TLS1.2 in the case of block cipher algorithm. For block ciphers and SSL/TLS1.0 protocols, this is the result IV; for the ARC4 algorithm this is the I-J Pointer and the ARC4 state. 5. Fragment length, provided by the host, is provided to the output stream. 6. For TLS1.1/TLS1.2 and block cipher, IV is inserted in the output stream. 7. Payload data are passed into hash engine and to the crypto engine. The encrypted payload is passed to the output stream. 8. The calculated MAC value is inserted in the crypto engine. The encrypted MAC is passed to the output stream. 9. Optional padding data are inserted in the crypto engine. The encrypted padding is passed to the output stream. 10. Update result crypto state, which will be reused for the next packet. 11. The incremented sequence number is written back to the context record in memory. 318 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words C.4.3.7 Inbound Processing Introduction Inbound Input Type Version Len (frag) Fragment For block cipher , decrypt last two block’s and extract last byte Decrypt phase 1 Len (pad) block cipher only TLS 1.1 block ciphers only Decrypt phase 2 IV Payload RX.MAC Padding Len (pad) Len (pay load) = Len (frag) - Len (IV ) – Len (MAC) - Len (pad) - 1 Insert before hash from context Hash Seq Num Type calculate length Version Len (pay load) Append after decrypt Payload MAC Skip for SSL Inbound output Payload Figure C-17: SSL/TLS Inbound Processing The SSL/TLS inbound processing is a decrypt-hash type of processing, where decryption is done first and then part of the packet is hashed. Decryption phase 1 is done by the host and decryption phase 2 is done by the packet processor. Input Data (SSL/TLS Inbound) The following data must be provided by the host to the packet processor in order to perform inbound processing: • Incoming packet • Header checking information • Type, Version fields • 64-bit current sequence number • Pre-calculated length of the payload data • Cipher and hash keys • Packet IV for block cipher The Type and Version fields for checking are coming from the SPI field of the context record. The layout of SPI field is the same as described in “Input Data (Outbound)” on page 299. The packet processor can check if type and version is matching the values from inbound packet. In case of mismatch, SPI check error is generated. The current 8-byte (64-bit) sequence number is taken from two sequence number fields of the context record (non-incremented). The sequence number is incremented after successful processing of the packet. The overflow of the internal 64-bit sequence number results in a sequence number overflow error (see Table D-8 on page 430). Two length fields must be pre-calculated by host and provided via the token fields: • The pad_len – which is obtained after decrypting the two last blocks of data for block cipher algorithm. In the case of ARC4 and Null-crypto, padding is not necessary and decryption of last two blocks is also not required. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 319 Appendix C: Miscellaneous Accelerator Specifications • Length of the payload data – used for MAC calculation. This length is calculated using the following formula: len(payload) = len(fragment) – len(IV, for TLS1.1 and TLS1.2) - len(MAC) – len(total_pad_sequence). The IV for the block cipher is taken from IV field of context record. For the TLS1.1 and TLS1.2 protocols, this IV is retrieved from the inbound packet. Data Flow The data flow and instructions for inbound SSL/TLS processing are shown in Table C-45. Table C-45. Inbound SSL/TLS Processing Flow N Instruction and Explanation Source of Data Destination Remove Hash Cipher Output Context DIRECTION (Bypass data) input - - - Bypass - RETRIEVE (store type/version in the context) input - - - - type/ version INSERT (seq. number) context - seq. number - - - INSERT a (insert stored context - type/ version - - - INSERT (payload length) token - payload length - REMOVE (fragment length) input fragment length - - - - input - - - - IV (IV offset) REMOVE RESULT (store position for removal MAC from the output buffer) - - - - - - DIRECTION (payload) input - payload payload payload - DIRECTION c (MAC + padding) input - - MAC + padding MAC + padding - Execution of REMOVE_RESULT instruction output buffer MAC - - - - VERIFY_FIELDS (verify SPI, padding, MAC) - - - - - - (SPI offset) type/ver- sion) RETRIEVE (IV) 320 b) - Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Control Words Table C-45. Inbound SSL/TLS Processing Flow N a. b. c. d. Instruction and Explanation Source of Data Destination Remove Hash Cipher Output Context CONTEXT_ACCESS d (update crypto state) context - - - - crypto state CONTEXT_ACCESS (update sequence number in context record in memory) context - - - - Sequence number For SSL packet ‘Version’ field is not inserted in the hash stream, but still received. Therefore this instruction should hash only the Type field. The IV is retrieved from input packet only for TLS1.1 and TLS1.2 in case of block cipher. For this instruction, length of padding must be known up front (see “Input Data (SSL/TLS Outbound)” on page 317). For block ciphers and SSL/TLS1.0 protocols, this is result IV, for ARC4 algorithm this is I-J Pointer and ARC4 state. The inbound processing functions a follows: 1. The bypass data are directly copied to the output stream. 2. Type and version from the inbound packet are stored in SPI field. 3. Sequence number is taken from the context and passed to hash engine. 4. Stored type and version are provided to the hash engine. 5. Payload length field is taken from token and passed to the hash engine. 6. Fragment length is removed from the input stream. 7. For TLS1.1 and TLS1.2, the IV is retrieved from the input stream and stored in the context record. 8. Remember the position of the MAC field in the output stream. 9. Decrypt payload and then provide payload to the hash engine and also to the output stream. 10. Decrypt MAC and padding and then insert them to the output stream. 11. Verify SPI, sequence number, padding data and also compare calculated MAC with retrieved MAC from the input. 12. Update result crypto state, which will be reused for the next packet. 13. In case of successful processing, the incremented sequence number is written back to the memory, where sequence number field in context record is located. Inbound Checks The SSL/TLS token can contain instruction for performing a number of inbound checks. The available checks are: • Type and version check – inbound Type and version fields are checked against Type and version fields stored in the SPI field of the context. In the case of a mismatch, the SPI check failure error is generated. • Sequence number check – increment of internal sequence number can lead to overflow of 64bit counter. The situation, when the counter overflows causes a sequence number check failure. • Pad verification – for block ciphers, the sequence of padding bytes after decryption can be detected and removed. Wrong padding sequence causes pad verification failure . Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 321 Appendix C: Miscellaneous Accelerator Specifications • MAC check – the calculated MAC value during inbound processing is compared with the MAC value received from the input stream. In the case of a mismatch, the authentication failure error is generated. C.5 Public Key Accelerator (PKA) C.5.1 PKA Firmware Architecture Overview The main objective of the PKA firmware is to manage the farm engines and AES core. This means that commands given via the PKI command interface are handed off to the farm engines and when a farm engine is ready with the assigned command, the result is either copied back via the PKI command interface or used for the follow-up command on the same farm engine. Summarized, the following PKA firmware functionality can be distinguished: • Copy the given PKI commands to the applicable command/result caches in buffer RAM for processing, • Select a command from command caches to process, • Copy and optionally decrypt the necessary PKA vector(s), • Hand-off the PKA vector(s) and command to a free farm engine, • Copy the result from the ready farm engine to the applicable result cache and optionally the result PKA vector(s), • Copy the result(s) to the PKI command/result interface. Note: Some PKI commands are immediately processed without farm engine involvement. Figure C-18 shows an overview of all tasks and triggers based on the required PKA functionality. 322 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Command available DMA Channel ready/error interrupt CommandCopy (Ring0..3) Start task Fork CommandHandler (Pre-farm) DMA Channel ready/error interrupt or PKCP interrupt Resume task (free farm) Copy vectors and registers data to Farmx Zeroize DMA Channel FIFO (vector copy) Farm0..9 Start ResultCopyRingx task Copy vectors and registers data to/from Farmx Resume task (free farm or linked command to process) FarmHandler (Post-farm) Resume task (result written to ring) Farmx LNME reset and memory zeroize Farm LNME reset & memory zeroize DMA Channel ready/error interrupt Fork FarmReady interrupt or DMA Channel ready/error interrupt Start ResultCopyRingx task ResultCopy (Ring0..3) DMA Channel ready/error interrupt Resume task (free result ring entry) Figure C-18: PKA Task Overview The responsibilities of the tasks are defined as follows: • CommandCopy These tasks (one for each command/result ring) copies the PKI commands from the CPU memory to the applicable command/result cache and starts the CommandHandler task if needed. • CommandHandler This task selects, based on the specified ring priority, a PKI command from the command caches to process. For the selected PKI command the required PKA vectors are copied from the CPU memory and optionally decrypted. Depending on the PKI command a farm engine is allocated and started with a PKA command derived from the PKI command. When no farm engine involvement is required, the result is copied to the applicable command/result cache and the optional result PKA vectors are copied to the CPU memory. • Farm0..9 These tasks represent the farm engines. A farm engine executes the off-loaded PKA command. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 323 Appendix C: Miscellaneous Accelerator Specifications • FarmHandler This task is started when a farm engine is ready. The task selects the first farm engine that reported ready, copies result and optionally the result PKA vectors. When the PKI command is completed, the farm engine is released and the optional result PKA vectors are copied to the CPU memory and the result is copied to the applicable command/result cache. When the PKI command is not yet completed, the farm engine is loaded with the follow-up PKA command. • ResultCopy These tasks (one for each command/result ring) copy the PKI results from the command/ result cache to the CPU memory. The CommandHandler or FarmHandler will start the applicable task when necessary. • Zeroize DMA channel FIFO (vector copy) This task zeroizes the DMA channel FIFO that was used during the PKA vector copy operation. The CommandHandler task will start this task when all PKA vectors are copied for the PKI command. • Farm LNME reset and memory zeroize This task initiates the LNME reset and zeroizes the farm engine memory. The FarmHandler task will start this task when there is no PKA command (follow-up or the initial command of the next PKI command) for the farm engine available. The zeroize functionality implements the “FIPS (140-3)” on page 579, Security Level 3 functionality. C.5.2 Command and Vector Copy and Zeroization Figure C-19 shows how the PKI commands/results and PKA Vectors are copied between the CPU Memory and the RAM areas of the PKA and which task is responsible for the copy operation. Also included in these figures are important zeroize action points (see Figure C-19). This figure also shows how it is performed for PKI commands that do not require a farm engine. Figure C-19 shows the PKI commands that do need a farm engine, with an optional follow-up PKA command situation. The zeroize functionality zeroizes the DMA channel FIFO after the copy operations, and all used memory after the PKI command is off-loaded to a farm engine or is completed. In the case of farm engine involvement, the LNME and farm memory are zeroized when the farm engine is not directly needed for a PKA command. In general the buffer RAM contains the ring configuration, command/result caches and the public PKA master controller scratchpad. The secure RAM cannot be accessed from outside (CPU) and contains the Key Decrypt Key management, the private PKA master controller scratchpad and the data area for the CommandHandler and FarmHandler tasks. This data area is split into a pre-data area for the CommandHandler and a post-data area for the FarmHandler task. The Farm RAM is the workspace of the farm engine and contains the PKA input and output vectors and scratchpad. 324 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) CPU Memory Command(s) Buffer RAM Command Copy DMA Ch0 Zeroize DMA Ch FIFO (separate task) Secure RAM Command(s) Command Handle DMA Ch0 Input Vector(s) Input Vector(s) DMA Ch0 Output Vector(s) Output Vector(s) Result Copy Result DMA Ch0 Result Figure C-19: No Farm Engine Command and Data Copy Zeroize DMA Ch FIFO (separate task) CPU Memory Command(s) Command Copy DMA Ch0 Buffer RAM Zeroize pre-data area Secure RAM Farm RAM Command(s) Command Handle DMA Ch0 Input Vector(s) DMA Ch0/1 Command Handle Input Vector(s) Registers DMA Ch1 DMA Ch1 Input Vector(s) Registers Farm Ready Registers Zeroize post-data area DMA Ch2 DMA Ch2 Intermediate Vector(s) Registers DMA Ch2 DMA Ch2 Internal Registers Output Vector(s) Input Vector(s) DMA Ch0 Output Vector(s) Registers Output Vector(s) DMA Ch2 DMA Ch2 Result Copy Result DMA Ch0 Result Zeroize post-data area (also DMA ch FIFO) Farm Registers Farm Ready Farm Ready Farm Internal Registers Output Vector(s) Reset LNME and zeroize farm memory when farm is not used anymore (separate task) Figure C-20: Farm Engine-Related Command and Data Copy Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 325 Appendix C: Miscellaneous Accelerator Specifications C.5.3 PKI Command Interface This document describes the PKA command interface, based on (up to four independent) command descriptor rings, with a separate result descriptor ring for each command ring. Only a small part of this interface is built in hardware (the command and result counters with associated interrupt logic), most of the functionality is defined by the firmware running on the PKA master controller Sequencer. Some parts of this document go into internal details of operation that are not ‘visible’ outside the device – this is meant as a kind of ‘reality check’ to prevent defining things that cannot work within the hardware framework of the PKA. C.5.4 Main PKI Command Interface The main PKI command interface uses descriptor/result rings held in Host memory space. Descriptors do not contain any vector data – they contain pointers to vectors in Host address space. C.5.4.1 Descriptor Ring Management Descriptors are 32 bytes in size. Four separate command descriptor rings holding up to 64K (65536) descriptors each can be used, each accompanied by a result descriptor ring of the same size. Command and result descriptor rings can be co-located or placed at different (non-overlapping) locations in Host space. If multiple rings are used, selection of the ring that supplies the next PKI command to execute is normally done using rotating priority. It is also possible to place Ring 0 at a higher priority than the remaining rings or to turn the rotating priority off (in which case Ring 0 gets the lowest priority and Ring 3 the highest priority). It is recommended to use separate rings if large differences in execution times for commands are expected. This prevents a lot of results for short execution time commands being stalled by one result for a long execution time command. The reason for this is that most of the internal buffer RAM is used to buffer command descriptors from each command ring – no new commands can be loaded when the oldest command in this buffer has not completed yet. Read/write pointers for the rings should be kept locally by the Host and the PKA master controller (the latter will use some words of buffer RAM to hold them, providing progress indication and a re-sync capability). No true ‘ownership’ bits are used in the descriptors – these are not necessary as the command counters can be used to figure out whether new commands can be written – result descriptors contain two ‘written zero’ bits that can be used (by a driver) for ownership indications but are mainly intended to prevent result interrupt race problems. C.5.4.2 Descriptor Ring Control/Status Words The Host must write the ring base addresses, size and initial read and write pointers at the start of the buffer RAM (see Table C-46 below) before writing the PKA_RING_OPTIONS word that contains the ring option settings. The ring configuration settings (addresses, pointers, and option settings) are processed when the first command is given; they are copied into secure RAM before actual use and take effect from then onwards. Changing the ring configuration can only be done by cycling the PKA through a reset. Table C-46 provides the layout of the 8K Byte buffer RAM. The separate control words are described in the next sections. The buffer RAM address space ranges from 0x00000 to 0x01FFF. 326 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Table C-46. Buffer RAM Layout for PKI Interface Byte Address (Within Buffer RAM) Control / Status Word Name Description 0x00000 CMMD_RING_BASE_0 Base address for command ring 0 0x00004 Reserved, write zero - 0x00008 RSLT_RING_BASE_0 Base address for result ring 0 0x0000C Reserved, write zero - 0x00010 CMMD_RING_BASE_1 Base address for command ring 1 0x00014 Reserved, write zero - 0x00018 RSLT_RING_BASE_1 Base address for result ring 1 0x0001C Reserved, write zero - 0x00020 CMMD_RING_BASE_2 Base address for command ring 2 0x00024 Reserved, write zero - 0x00028 RSLT_RING_BASE_2 Base address for result ring 2 0x0002C Reserved, write zero - 0x00030 CMMD_RING_BASE_3 Base address for command ring 3 0x00034 Reserved, write zero - 0x00038 RSLT_RING_BASE_3 Base address for result ring 3 0x0003C Reserved, write zero - 0x00040 RING_SIZE_0 Number and offset of descriptors in command and result rings 0 0x00044 RING_SIZE_1 Number and offset of descriptors in command and result rings 1 0x00048 RING_SIZE_2 Number and offset of descriptors in command and result rings 2 0x0004C RING_SIZE_3 Number and offset of descriptors in command and result rings 3 0x00050 RING_RW_PTRS_0 Read pointer of command ring 0, write pointer for result ring 0 0x00054 RING_RW_PTRS_1 Read pointer of command ring 1, write pointer for result ring 1 0x00058 RING_RW_PTRS_2 Read pointer of command ring 2, write pointer for result ring 2 0x0005C RING_RW_PTRS_3 Read pointer of command ring 3, write pointer for result ring 3 0x00060 – 0x0006F Reserved, write zero - 0x00070 PKA_RING_OPTIONS Main control word Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 327 Appendix C: Miscellaneous Accelerator Specifications Table C-46. Buffer RAM Layout for PKI Interface (continued) Byte Address (Within Buffer RAM) Control / Status Word Name Description 0x00074 MASTER_FW_VERSION Master firmware version information 0x00078 0x01FFF Reserved, do not modify (includes command caches for rings 0 – 3) Command Ring Base Address Control Words (CMMD_RING_BASE_0 … _3) CMMD_RING_BASE_0 (Read/Write), 18-bit Address in Host Target Window: 0x00000 CMMD_RING_BASE_1 (Read/Write), 18-bit Address in Host Target Window: 0x00010 CMMD_RING_BASE_2 (Read/Write), 18-bit Address in Host Target Window: 0x00020 CMMD_RING_BASE_3 (Read/Write), 18-bit Address in Host Target Window: 0x00030 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 X X X X X X X X X Command ring base address X X X X X X X X X X X X X X X X X X X X X X X Table C-47. Command Ring Base Address Control Words Bit Descriptions Bits Name Type Function [31:0] Command ring base address R/W This is the base address of one command ring in Host address space. For performance reasons, it is suggested to align the base address to an 8 byte boundary, but this is not an absolute requirement. A command ring can be co-located with the accompanying result ring, in which case their base addresses must be identical – when not co-located, they might not have any overlap. Result Ring Base Address Control Words (RSLT_RING_BASE_0 … _3) RSLT_RING_BASE_0 (Read/Write), 18-bit Address in Host Target Window: 0x00008 RSLT_RING_BASE_1 (Read/Write), 18-bit Address in Host Target Window: 0x00018 RSLT_RING_BASE_2 (Read/Write), 18-bit Address in Host Target Window: 0x00028 RSLT_RING_BASE_3 (Read/Write), 18-bit Address in Host Target Window: 0x00038 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 X X X X X X X X X Result ring base address X 328 X X X X X X X X X X X X X X X X X X X X X X Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Table C-48. Result Ring Base Address Control Words Bit Descriptions Bits Name Type Function [31:0] Result ring base address R/W This is the base address of one result ring in Host address space. For performance reasons, it is suggested to align the base address to an 8 byte boundary, but this is not an absolute requirement. A result ring can be co-located with the accompanying command ring, in which case their base addresses must be identical – when not co-located, they might not have any overlap. RING_SIZE_x RING_SIZE_0 (Read/Write), 18-bit Address in Host Target Window: 0x00040 RING_SIZE_1 (Read/Write), 18-bit Address in Host Target Window: 0x00044 RING_SIZE_2 (Read/Write), 18-bit Address in Host Target Window: 0x00048 RING_SIZE_3 (Read/Write), 18-bit Address in Host Target Window: 0x0004C 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 Descriptor offset X X X X X 8 7 6 5 4 3 2 1 0 X X X X X X X X X Ring size X X X X X X X X X X X X X X X X X X Table C-49. Ring Size and Descriptor Offset Control Words Bit Descriptions Bits Name Type Function [31:16] Descriptor offset R/W This field specifies the offset in bytes between the starting locations of command descriptors, in the range 33 … 65535. Value 0 indicates that the descriptors are adjacent (with actual offset of 32 bytes) – in that case, reading command descriptors is optimized to read more than one in a single DMA action. Values 1 … 32 are reserved and should not be used. The accompanying result ring will have the same (result) descriptor offset. [15:0] Ring size R/W This field specifies the size of a command ring in number of descriptors, minus 1. Minimum value is 0 (for 1 descriptor); maximum value is 65535 (for 64K descriptors). The accompanying result ring will have the same size. RING_RW_PTRS_x RING_RW_PTRS_0 (Read/Write), 18-bit Address in Host Target Window: 0x00050 RING_RW_PTRS_1 (Read/Write), 18-bit Address in Host Target Window: 0x00054 RING_RW_PTRS_2 (Read/Write), 18-bit Address in Host Target Window: 0x00058 RING_RW_PTRS_3 (Read/Write), 18-bit Address in Host Target Window: 0x0005C 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 Result ring write pointer X X X X X X X 8 7 6 5 4 3 2 1 0 X X X X X X X X X Command ring read pointer X X X X X X X X X X X X X X X Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice X 329 Appendix C: Miscellaneous Accelerator Specifications Table C-50. Ring Read/Write Pointers Bit Descriptions Bits Name Type Function [31:16] Result ring write pointer R/W This field indicates the entry number in the result ring that will be written next by the PKA. It is reset to zero after starting up and is updated after every result descriptor write DMA operation. Pointers wrap around, the maximum value of this field equals the value of the ‘Ring size’ field of the corresponding RING_SIZE_x control word. [15:0] Command ring read pointer R/W This field indicates the entry number in the command ring that will be read next by the PKA. It is reset to zero after starting up and is updated after every command descriptor read DMA operation. Pointers wrap around, the maximum value of this field equals the value of the ‘Ring size’ field of the corresponding RING_SIZE_x control word. PKA_RING_OPTIONS PKA_RING_OPTIONS (Read/Write), 18-bit Address in Host Target Window: 0x00070 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 ‘Signature’ byte X X X X X X X X X X X X X X X X X X Ring 0 in-order X 4 Ring 1 in-order X 5 Ring 2 in-order X 6 Ring 3 in-order X Reserved 7 Zero KDKs X 8 X X X X X 3 2 1 0 Ring Ring enable prio control control X X X X Table C-51. PKA Ring Options Control Word Bit Descriptions 330 Bits Name Type Function [31:24] ‘Signature’ byte R/W This byte must contain 0x46 – it is used because these options are transferred through RAM which does not have a defined reset value. The PKA master controller keeps reading this word at start-up until the ‘Signature’ byte contains 0x46 and the ‘Reserved’ field contains zero. [23:9] Reserved [8] Zero KDKs R/W If this bit is ‘1’, the PKI Key Decryption Keys (KDK) storage areas and associated control words will be zeroed by internal FW during the boot-up procedure. This will indicate all KDKs as being invalid. If this bit is ‘0’, it is assumed that the KDK storage and control words have already been set up in secure RAM and they will be left intact during boot-up. Note that this bit is (functionally) forced to ‘1’ during a High Assurance mode boot-up as the KDK area is initially used to hold ‘farm’ engine firmware in that case. [7:4] Ring X in-order R/W These bits indicate whether a result ring delivers results strictly in-order (‘1’) or that result descriptors are written to the result ring as soon as they become available, so out-oforder, (‘0’). In the latter case, it is important that a driver tags each command descriptor with a number to be able to figure out the command to which a result belongs. Bits MUST be written with a 0 and ignored on a read. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Table C-51. PKA Ring Options Control Word Bit Descriptions (continued) Bits Name Type Function [3:2] Ring enable control R/W This field specifies how many rings will be used: ‘00’ = ring 0 only, ‘01’ = rings 0 and 1, ‘10’ = rings 0, 1 and 2, ‘11’ = all four rings. [1:0] Ring prio control R/W Ring priority control. This field specifies the ring priorities: ‘00’ = full rotating priority, ‘01’ = fixed priority (ring 0 lowest), ‘10’ = ring 0 has the highest priority (the remaining rings have rotating priority), ‘11’ = reserved, do not use. To change ring options, the complete PKA must be cycled through reset. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 331 Appendix C: Miscellaneous Accelerator Specifications MASTER_FW_VERSION MASTER_FW_VERSION (Read/Write), 18-bit Address in Host Target Window: 0x00074 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 Reserved X X X 8 7 6 5 4 3 2 1 0 X X Master FW major version num- Master FW minor version num- Master FW patch level ber ber X X X X X X X X X X X X X X X X X X X X X X X X X X X Table C-52. Master Firmware Version Information Bits Name Type Function [31:24] Reserved R/W Bits MUST be written with a 0 and ignored on a read. [23:16] Master FW major version number R/W Indicates the major version number of this master firmware release – the first full release will have version 1.0, so the value of this field will be 0x01. [15:8] Master FW minor version number R/W Indicates the minor version number of this master firmware release – the first full release will have version 1.0, so the value of this field will be 0x00. [7:0] Master FW patch level R/W Indicates the ‘patch level’ of this master firmware release, will be 0 to start with. Note: Although indicated as ‘Read/Write’, this memory location is meant to be handled as readonly. Note: The first two instructions (words) of the main master firmware image also contain this information. The first word holds the patch level in bits [7:0] and the minor FW version number in bits [15:8] and the second word holds the major FW version number in bits [7:0]. C.5.5 PKI Command and Result Descriptors C.5.5.1 Command Descriptor Contents Command descriptors are 32 bytes (8 words of 32 bits) long. The Host indicates the presence of new command descriptors in a ring by incrementing the command counter associated with that ring (after the descriptor contents have been written). It is possible to ‘link’ command descriptors so that one command can only be executed when the previous linked command has been executed. These linked descriptors must be transferred into the ring as a whole (that is the command counter must be incremented by the number of linked commands). 332 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) PKI Command Descriptor The generic layout of a PKI command descriptor is as follows: PKI Command Descriptor (Read/Write) 3124 Driver status (bits [30:29]) Linked (bit [31]) 31:28 2316 Shift count / odd powers (bits [28:24]) KDK nr. (bits [23:22]) Byte Offset Encrypted vectors bit mask (bits [21:16]) 27:24 00000 23:20 Pointer ‘E’ (bits [31:0]). 19:16 ‘Tag’ word for driver use (bits [31:0]). 15:12 Pointer ‘D’ (bits [31:0]). 11:8 Pointer ‘C’ (bits [31:0]). Length ‘B’ (bits [26:18]) 7:4 Pointer ‘B’ (bits [31:0]). 3:0 Pointer ‘A’ (bits [31:0]). 158 70 00000000 Command (bits [7:0]) 0000000 Length ‘A’ (bits [10:2]) 00 Table C-53. PKI Command Descriptor Field Description Pointer ‘A’ … ‘E’ These words provide up to 5 parameter/result pointers in Host space. The length of the parameters and results is a multiple of 4 bytes but the start addresses can be at any byte boundary. It is allowed for pointers to point to the same memory location. ‘Tag’ word for driver use This complete word can be used by a Host driver to hold an identification value or pointer for its own administration. This word will be present (unchanged) in the result descriptor for this command. Length ‘A’ / ‘B’ These fields indicate the length of input vectors in 32 bit words. In general, they indicate the lengths of vectors ‘A’ and ‘B’ but their actual use depends on the operation performed. Command This field indicates which command to execute. Commands include (almost) all commands of a standard PKA Engine module as described in “PKI Command/ Result Specifics (Firmware Dependent)” on page 337 with higher protocol level commands added. The standard PKA Engine commands do not use pointer ‘E’. Encrypted vectors bit mask (only applicable for the PKAb and PKAd This field indicates (with a ‘1’) which of the five input parameter vectors contains encrypted data that must be decrypted before use. If a parameter vector contains sub-vectors, these are decrypted separately. Each command has specific rules as to which parameters can be provided in encrypted form. Illegal selection of encrypted vectors results in an error. Bits [19:16] of the last descriptor word control pointers ‘A’ … ‘D’, bit [21] controls pointer ‘E’ (bit [20] should be kept zero). NOTE: When encrypted vectors are used, be careful with the crypto mode selection (see “PKI Key Decrypt Key Control Words” on page 368). When using AES-ECB or AES-CBC, the sub-vectors must be multiples of 128 bits long as these crypto modes only work on full-length AES blocks. When using AES-CFB, AES-OFB or AES-CTR, this restriction is not applicable as the last block processed for these modes does not need to be a full-length AES block. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 333 Appendix C: Miscellaneous Accelerator Specifications Table C-53. PKI Command Descriptor (continued) Field Description KDK nr This field indicates which of the (up to) four Key Decryption Keys must be used to decrypt the input parameter vectors specified by the ‘Encrypted vector bit mask’. The value of this field is directly used as KDK number (in the range 0 … 3). Using an invalid KDK will result in an error. Shift count / odd powers This field is used to convey the number of bits to shift (in the range 0 … 31) for shift left/right basic operations. It is used to convey the number of odd powers to use for modular exponentiation operations (with and without CRT). Allowed values for odd powers are in the range 1 … 16; trying to select a too-high value results in an error. Driver status. These two bits are reserved for use by the Host driver. In a result descriptor, they are forced to zero (as ‘Written zero’ bits). When overlapping the command and result rings, they can be used to indicate the state of a descriptor block (‘empty’, ‘command’ or ‘result’ – the last would get fixed code zero). When not used, these bits can be kept zero. They do not influence the PKI command handling. Linked This bit indicates the linked state of the descriptor block as follows: 0 Normal command descriptor. 1 Linked command descriptor. The next command in this ring can not be executed before this one has finished execution. Note: For linked descriptors, the PKA master controller will scan forward in the ring until it finds the location of the first command descriptor following the linked descriptors (the last one of the linked descriptors is a normal command descriptor). Handling the linked descriptors is done in the same order as they are placed in the ring. Normal arbitration between commands placed in rings is resumed with the command following the linked chain (if any). In essence, a chain of linked descriptors is handled as a single command. C.5.5.2 Result Descriptor Contents After finishing a command, the PKA master controller will convert the original command descriptor (as described in Command Descriptor Contents) into a result descriptor and write this descriptor to the result ring. The result status at the end of command execution is returned mostly in empty fields of the command descriptor, except for the sixth word which held pointer E (which is completely modified). 334 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) PKI Result Descriptor The generic layout of a PKI result descriptor is as shown in Table C-54 (yellow fields are copied from the command descriptor). PKI Result Descriptor (Read/Write) 3124 Written zero (bits [30:29]) Linked (bit [31]) 31:28 27:24 2316 Shift count / odd powers (bits [28:24]) 158 70 Encrypted vectors Result code (bits [15:8]) bit mask (bits [21:16]) Length ‘B’ (bits [26:18]) 0000000 Command (bits [7:0]) Length ‘A’ (bits [10:2]) 00 CMP result (bits [31:29]) 00 KDK nr. (bits [23:22]) Byte Offset Modulo = 0 ([31]) Modulo MSW offset (bits [28:18]) 000000 Main result MS bit offset (bits [22:18]) 19:16 ‘Tag’ word for driver use (bits [31:0]). 15:12 Pointer ‘D’ (bits [31:0]). 11:8 Pointer ‘C’ (bits [31:0]). 7:4 Pointer ‘B’ (bits [31:0]). 3:0 Pointer ‘A’ (bits [31:0]). 00 00 Main result MSW offset (bits [12:2]) 00 Result = 0 ([15]) 00 23:20 Table C-54. PKI Result Descriptor Field Description Pointer ‘A’ … ‘D’ These words provided up to 4 parameter/result pointers, unchanged. ‘Tag’ word for driver use This word is unchanged from the command descriptor and can be used by a driver to match the result to a given command (for example when out-oforder result delivery is selected). Main result MSW offset / Result = 0 These fields are copied almost directly from bits [10:0] respectively bit [15] of the FARM_PKA_MSW_X register of the ‘farm’ engine on which the command was executed. The only change is that the start offset for the main result vector of the command has been subtracted here (the Host need not concern itself with internal ‘farm’ data RAM management). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 335 Appendix C: Miscellaneous Accelerator Specifications Table C-54. PKI Result Descriptor (continued) Field Description Modulo MSW offset / Modulo = 0 / Main result MS bit offset These fields are copied almost directly from bits [10:0] respectively bit [15] and bits [4:0] of the FARM_PKA_DIVMSW_X register of the ‘farm’ engine on which the command was executed. The only change is that the start offset for the Modulo result vector of the command has been subtracted here (the Host need not concern itself with internal ‘farm’ data RAM management). Note that only the basic Modulo and Divide operations return the Modulo MSW offset and zero indication, all others return the Main result MS bit offset (which will be zero if the Main result is zero). NOTE: For the Compare, ECDSA, DSA and AES known answer commands the ‘Main result MSW offset/Result = 0’ and ‘Modulo MSW offset/Modulo = 0/ Main result MS bit offset’ information must be ignored. Length ‘A’ / ‘B’ These fields were used to indicate the length of input vectors in 32 bit words, unchanged. CMP result bits These bits are only updated for a basic ‘Compare’ command and reflect the state of bits [2:0] of the FARM_PKA_COMPARE_X register of the ‘farm’ engine on which the command was run. Command This field indicates which command was executed, unchanged. Result code This field indicates the global result after the operation. Value 0x00 indicates that no errors were encountered. Other values reflect a warning or error, refer to Table C-55 for a complete overview of result codes. Encrypted vectors bit mask This field indicates which of the five input parameter vectors were provided in encrypted form, unchanged. KDK nr This field indicates which KDK had to be used to decrypt encrypted parameter vectors, unchanged. Shift count / odd powers This field was used to convey the number of bits to shift for shift left/right basic operations or the number of odd powers to use for modular exponentiation operations, unchanged. Written zero These bits are forced to zero when writing out a result descriptor. They can be used in conjunction with the ‘Driver status’ bits in the command descriptor to indicate the state of a descriptor block in a combined command/result ring. For a separate result ring, these bits can be used to determine that the writing of a result descriptor to Host space has been completed, provided that the driver sets these bits non-zero for empty result descriptor blocks. Linked This field indicates the ‘linked’ state of the descriptor block as follows (unchanged): 0 1 Normal result descriptor, Linked result descriptor. This result descriptor does not contain the final result of a linked chain of command descriptors. Note: Linked command descriptors deliver linked result descriptors one-by-one, it is up to the Host to ignore or use the intermediate results. If a linked command returns an error (that is the Result code bit [15] is a 1), the error code is propagated to the last (non-linked) result descriptor and linked commands following the error command are not executed. Note: If a descriptor ring has in-order result delivery selected, the result counter is only incremented for consecutive completed commands. This can mean that a result counter increments very rapidly after a very time-consuming command that stalled other already completed commands has been completed. With out-of-order result delivery, the result 336 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) counter is incremented immediately after a command has been completed and the result descriptor has been written to Host space. This makes the driver more complex because it must use the Tag word to keep track of commands and results but it can increase overall system performance. Note: The Written zero field in the last word of the result descriptor is used to prevent interrupt race errors. These can happen when the descriptor available interrupt reaches the Host processor before the descriptor data has been written across the Host bus into the external memory. The interrupt handler should poll the Written zero field to check that the data has indeed arrived before relying on other information in the result descriptor. C.5.5.3 PKI Command/Result Specifics (Firmware Dependent) This section gives a point-by-point description of the various operations available within the PKA. All inputs and output must be considered as unsigned integers. CAUTION: Unless otherwise indicated, parameter vectors must not be input in encrypted form. This is to protect the stored Key Decryption Keys against attacks. Table C-55 lists the result codes that are currently defined. The highest bit of the result code bytes (bit [15] of the last word in the result descriptor) indicates whether an error occurred (1) or not (0). If an error occurred, result vectors are not written and only the result code in the result descriptor conveys meaningful information. Table C-55. PKI Result Code Values Code Description 0x00 No error 0x81 Modulus was even 0x02 Exponent was 0, result returned as value 1 (for a modular exponentiation) 0x83 Modulus was too short (less than 33 significant bits) 0x04 Exponent was 1, result returns input value (for a modular exponentiation) 0x85 Odd powers not in range 1 … 16 0x86 Result point of ECC operation is ‘at infinity’ – not a real error! 0x87 Unknown command 0x88 Illegal encrypted parameter use 0x89 Operand length error 0x8A Farm memory too small for operation 0x8B Modular inverse does not exist 0x8C Operand value error 0x8D (Intermediate) Result value error 0xC0 Memory deadlock error Others Reserved Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 337 Appendix C: Miscellaneous Accelerator Specifications Command List Table C-56 is the list of commands or functions. For information about command restrictions, see “Restrictions on Input Vectors for PKCP Operations” on page 353. Table C-56. Master Command List 338 Command (Description) On page . . . Add (Basic Arithmetic) page 339 Subtract (Basic Arithmetic) page 339 Add/Subtract Combination (Basic Arithmetic) page 340 Multiply (Basic Arithmetic) page 340 Divide (Basic Arithmetic) page 341 Modulo (Basic Arithmetic) page 341 Shift Left (Basic Arithmetic) page 342 Shift Right (Basic Arithmetic) page 342 Compare (Basic Arithmetic) page 343 Copy (Basic Arithmetic) page 343 Modular Exponentiation without CRT (Complex Arithmetic) page 344 Modular Exponentiation with CRT (Complex Arithmetic) page 345 Modular Inversion (Complex Arithmetic) page 346 ECC Point Addition/Doubling (Complex Arithmetic) page 346 ECC Point Multiplication (Complex Arithmetic) page 347 ECDSA Signature Generation (High-Level PKA Operations) page 348 ECDSA Signature Verification (High-Level PKA Operations) page 349 DSA Signature Generation (High-Level PKA Operations) page 350 AES Known Answer (Verify Operations) page 352 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Add (Basic Arithmetic) Command Code 0x01 Operation A+BC Inputs ‘A’ (length ‘A’, ≤ 130 words long), ‘B’ (length ‘B’, ≤ 130 words long) Result ‘C’ (max (length ‘A’, length ‘B’) + 1 word long) Possible Errors • Illegal encrypted parameter use • Operand length error Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Subtract (Basic Arithmetic) Command Code 0x02 Operation A−BC Inputs ‘A’ (length ‘A’, ≤ 130 words long), ‘B’ (length ‘B’, ≤ 130 words long) Result ‘C’ (max (length ‘A’, length ‘B’) long) Possible Errors • Illegal encrypted parameter use • Operand length error Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 339 Appendix C: Miscellaneous Accelerator Specifications Add/Subtract Combination (Basic Arithmetic) Command Code 0x03 Operation A+C−BD Inputs ‘A’ (length ‘A’, ≤ 130 words long), ‘B’ (length ‘A’, ≤ 130 words long), ‘C’ (length ‘A’, ≤ 130 words long) Result ‘D’ (length ‘A’ + 1 word long) Possible Errors • Illegal encrypted parameter use • Operand length error Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Multiply (Basic Arithmetic) Command Code 0x04 Operation A×BC Inputs ‘A’ (length ‘A’, ≤ 130 words long), ‘B’ (length ‘B’, ≤ 130 words long) Result ‘C’ (length ‘A’ + length ‘B’ long) Possible Errors • Illegal encrypted parameter use • Operand length error Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. 340 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Divide (Basic Arithmetic) Command Code 0x05 Operation A mod B C, A div B D Inputs ‘A’ (length ‘A’, must be ≥ length ‘B’, ≤ 130 words long) ‘B’ (length ‘B’, must be > 1, ≤ 130 words long) ‘C’ (length ‘B’ long), ‘D’ (length ‘A’ – length ‘B’ + 1 word long) Result Possible Errors • Illegal encrypted parameter use. • Operand length error. • Modulus was too short. Extra Status Result (‘D’) = 0, Main result (‘D’) MSW offset, Modulo (‘C’) = 0, Modulo (‘C’) MSW offset Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Modulo (Basic Arithmetic) Command Code 0x06 Operation A mod B C Inputs ‘A’ (length ‘A’, must be ≥ length ‘B’, ≤ 130 words long), ‘B’ (length ‘B’, must be > 1, ≤ 130 words long) Result ‘C’ (length ‘B’ long) Possible Errors • Illegal encrypted parameter use. • Operand length error. • Modulus was too short Extra Status Modulo (‘C’) = 0, Modulo (‘C’) MSW offset Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 341 Appendix C: Miscellaneous Accelerator Specifications Shift Left (Basic Arithmetic) Command Code 0x07 Operation A shl “shift count” C Inputs ‘A’ (length ‘A’, ≤ 130 words long), shift count (range 0 … 31 bits) Result ‘C’ (if “shift count” = 0: length ‘A’ long, else length ‘A’ + 1 word long) Possible Errors • Illegal encrypted parameter use • Operand length error Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Shift Right (Basic Arithmetic) Command Code 0x08 Operation A shr “shift count” C Inputs ‘A’ (length ‘A’, ≤ 130 words long), shift count (range 0 … 31 bits) Result ‘C’ (length ‘A’ long) Possible Errors • Illegal encrypted parameter use • Operand length error Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. 342 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Compare (Basic Arithmetic) Command Code 0x09 Operation compare values of ‘A’ and ‘B’ Inputs ‘A’ (length ‘A’, ≤ 130 words long), ‘B’ (length ‘A’, ≤ 130 words long) Result ‘A’ = ‘B’ (Compare result = 001), ‘A’ < ‘B’ (Compare result = 010), ‘A’ > ‘B’ (Compare result = 100) Possible Errors • Illegal encrypted parameter use • Operand length error Extra Status N/A Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Copy (Basic Arithmetic) Command Code 0x0A Operation AC Inputs ‘A’ (length ‘A’, ≤ 255 words long) Result ‘C’ (length ‘A’ long) Possible Errors • Illegal encrypted parameter use • Operand length error Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 343 Appendix C: Miscellaneous Accelerator Specifications Modular Exponentiation without CRT (Complex Arithmetic) Command Code 0x10 Operation CA mod B D Inputs ‘A’ (length ‘A’, which must be in range 1 … 130 words), can be an encrypted vector, ‘B’ (length ‘B’, which must be in range 2 … 130 words), can be an encrypted vector, ‘C’ (length ‘B’). Number of ‘odd powers’ (in the range 1 … 16). Result ‘D’ (length ‘B’ + 1 word long) Possible Errors • • • • • • • • Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Illegal encrypted parameter use Operand length error Modulus is even Modulus too short Exponent was 0 Exponent was 1 Odd powers out-of-range Farm memory too small (probably due to too high odd powers setting) Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. 344 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Modular Exponentiation with CRT (Complex Arithmetic) Command Code 0x11 Operation (Input mod Mod_P)Exp_P mod Mod_P X, (Input mod Mod_Q)Exp_Q mod Mod_Q) Y, (((X – Y) mod Mod_P) × Q_inv) mod Mod_P) × Mod_Q Z, Y+ZD Inputs ‘A’ points to Exp_P followed by Exp_Q with possibly one buffer worda in between (both length ‘A’, which must be in range 1 … 66 words), can be an encrypted vector, ‘B’ points to Mod_P followed by Mod_Q with one or two buffer wordsb in between (both length ‘B’, which must be in range 2 …66 words), can be an encrypted vector, ‘C’ points to Q_inv (length ‘B’), can be an encrypted vector, ‘E’ points to Input (2 × length ‘B’ long). The number of ‘odd powers’ (in the range 1 … 16). Result ‘D’ (2 × length ‘B’ long) Possible Errors • • • • • • • • Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Illegal encrypted parameter use Operand length error Modulus is even Modulus too short Exponent was 0 Exponent was 1 Odd powers out-of-range Farm memory too small (probably due to too high odd powers setting) a. The buffer word is inserted when length ‘A’ is odd. b. One buffer word when length ‘B’ is odd, two buffer words when length ‘B’ is even. Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 345 Appendix C: Miscellaneous Accelerator Specifications Modular Inversion (Complex Arithmetic) Command Code 0x12 Operation A–1 mod B D Inputs ‘A’ (length ‘A’, ≤ 130 words long), ‘B’ (length ‘B’, ≤ 130 words long) Result ‘D’ (length ‘B’ long) Possible Errors • • • • Extra Status Result = 0, Main result MSW offset, Main result MS bit offset Illegal encrypted parameter use Operand length error Modulus is even No inverse exists (GCD (A, B) ≠ 1) Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. ECC Point Addition/Doubling (Complex Arithmetic) Command Code 0x14 Operation Point addition (Pnt_A ≠ Pnt_C) or point doubling (Pnt_A = Pnt_C) on elliptic curve y2 = x3 + ax + b (mod p), Pnt_A + Pnt_C Pnt_D Inputs a ‘A’ points to Pnt_A.x followed by Pnt_A.y with two or three buffer words length ‘B’, which must be in range 2 … 24 words), ‘B’ points to modulus p followed by curve parameter a with two or three buffer wordsa in between (both length ‘B’, curve parameter b is not used here), can be an encrypted vector, ‘C’ points to Pnt_C.x followed by Pnt_C.y with two or three buffer wordsa in between (both length ‘B’), can be an encrypted vector Result ‘D’ points to Pnt_D.x followed by Pnt_D.y with two or three buffer wordsa in between (both length ‘B’) Possible Errors • • • • Extra Status Result = 0, Main result MSW offset, Main result MS bit offset. Note that this information refers to Pnt_D.x only. in between (both Illegal encrypted parameter use Operand length error Modulus is even Result point at infinity a. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd. Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. 346 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) ECC Point Multiplication (Complex Arithmetic) Command Code 0x15 Operation Point multiplication on elliptic curve y2 = x3 + ax + b (mod p), k × Pnt_C Pnt_D Inputs ‘A’ be points to scalar multiplication value k (length ‘A’, must be in range 1 … 24 words), can an encrypted vector. ‘B’ points to modulus p followed by curve parameters a and b with two or three buffer words in between (all of length ‘B’, which must be in range 2 … 24 words), can be an encrypted vector.a ‘C’ points to Pnt_C.x followed by Pnt_C.y with two or three buffer words in between (both length ‘B’), can be an encrypted vector. Result ‘D’ points to Pnt_D.x followed by Pnt_D.y with two or three buffer words in between (both length ‘B’). Possible Errors • • • • • Extra Status Result = 0, Main result MSW offset, Main result MS bit offset. Illegal encrypted parameter use Operand length error Modulus is even Modulus too short Result point at infinity Note that this information refers to Pnt_D.x only. a. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd. Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 347 Appendix C: Miscellaneous Accelerator Specifications ECDSA Signature Generation (High-Level PKA Operations) Command Code 0x20 Operation Generate r and s values of an ECDSA signature using elliptic curve y2 = x3 + ax + b (mod p, subgroup size n), base point Pnt_C, random value k, private key Alpha and message digest h: 1. Check that k is in range 1 … n–1 (if not, return an operand value error). 2. Calculate r = x1 (mod n), where (x1,y1) = k·Pnt_C (an ECC point multiplya). 3. If r equals zero, return with a result value error (must re-try with different k). 4. Calculate s = k −1·(h + r·Alpha) (mod n) (k −1 is a modular inversion). 5. If s equals zero, return with a result value error (must re-try with different k). Inputs ‘A’ points to private key Alpha (length ‘B’, which must be in range 2 … 24), can be an encrypted vector. ‘B’ points to modulus p followed by curve parameters a, b, n and base point coordinates Pnt_C.x followed by Pnt_C.y (all of length ‘B’), with two or three buffer wordsb between all sub-vectors, can be an encrypted vector. ‘C’ points to message digest h (length ‘B’). ‘E’ points to random value k (length ‘B’). Result ‘D’ points to r followed by s with two or three buffer words in between (both length ‘B’ – note that r and s must be non-zero, the algorithm must be re-run with a new k value if any of these two end up being zero (indicated with a result value error). Possible Errors • • • • • • • • Extra Status N/A Illegal encrypted parameter use Operand length error Operand value error Modulus is even Modulus too short No inverse exists Result point at infinity Result value error a. Only the x-coordinate of the result is used, so the y-coordinate is not calculated. b. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd. Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. 348 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) ECDSA Signature Verification (High-Level PKA Operations) Command Code 0x21 (With r” Write-Back), 0x25 (Without r” Write-Back) Operation Generate r” value (and r” = r’ indication) of an ECDSA verify using elliptic curve y2 = x3 + ax + b (mod p, subgroup size n), base point Pnt_C, public key point Pnt_A, message digest h and ECDSA signature values r’ and s’. 1. If r’ or s’ are outside the range 1 … n–1, return with an operand value error. 2. Calculate w = s’ −1 (mod n) (s’ −1 is a modular inversion). 3. Calculate u1 = h·w (mod n) and u2 = r’·w (mod n). 4. Calculate (x1,y1) = u1·Pnt_C + u2·Pnt_A (two point multiplies and one point add). 5. Calculate r” = x1 (mod n) and compare this value to the given r’. Inputs ‘A’ points to public key Pnt_A.x followed by Pnt_A.y with two or three buffer wordsa in between (both length ‘B’, which must be in range 2 … 24) ‘B’ points to modulus p followed by curve parameters a, b, n and base point coordinates Pnt_C.x followed by Pnt_C.y (all of length ‘B’) with two or three buffer wordsa between all sub-vectors, can be an encrypted vector. ‘C’ points to message digest h (length ‘B’). ‘E’ points to r’ followed by s’ with two or three buffer words in betweena (both length ‘B’). Result ‘D’ points to r” (length ‘B’) – when using command code 0x21 returned for external verification of a match, r” = r’ (Compare result = 001), r” < r’ (Compare result = 010), r” > r’ (Compare result = 100). Possible Errors • • • • • • • Extra Status N/A Illegal encrypted parameter use Operand length error Operand value error Modulus is even Modulus too short No inverse exists Result point at infinity a. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd. Note: The ‘E’ pointer to r’ and s’ is not present in the result descriptor. A driver is forced to keep track of the pointer to r’ if external comparison with r” is required. Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 349 Appendix C: Miscellaneous Accelerator Specifications DSA Signature Generation (High-Level PKA Operations) Command Code 0x22 Operation Generate r and s values of a DSA signature using prime p, sub-prime n, value g (g’s multiplicative order modulo p equals q), random value k, private key Alpha and message digest h (the number of significant bits in n should equal the length of h): 1. Check that k is in range 1 … n–1 (if not, return an operand value error). 2. Calculate r = ((gk mod p) mod n). 3. Calculate s = k −1·(h + r·Alpha) (mod n) (k −1 is a modular inversion). 4. If r or s equals zero, return with a result value error (must be re-run). Inputs ‘A’ points to private key Alpha (length ‘B’, which must be in range 2 … length ‘A’ words), can be an encrypted vector. ‘B’ points to prime p followed by value g (both length ‘A’, which must be in range 2 … 130 words), followed by sub-prime n (length ‘B’), with two or three buffer wordsa between all sub-vectors, can be an encrypted vector. ‘C’ points to message digest h (length ‘B’). ‘E’ points to random value k (length ‘B’). Number of ‘odd powers’ (in the range 1 … 16). Result ‘D’ Possible Errors • • • • • • • • • Extra Status N/A points to r followed by s with two or three buffer wordsb in between (both length ‘B’) – note that r and s must be non-zero, the algorithm must be re-run with a new k value if any of these two end up being zero (indicated by returning a result value error). Illegal encrypted parameter use Operand length error Operand value error Modulus is even Modulus too short No inverse exists Odd powers out-of-range Result value error Farm memory too small (probably due to too high odd powers setting) a. Two buffer words when length ‘A’ is even, three buffer words when length ‘A’ is odd. b. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd. Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. 350 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) DSA Signature Verification (High-Level PKA Operations) Command Code 0x23 (with r” write-back), 0x27 (without r” write-back) Operation Generate r” value (and r” = r’ indication) of a DSA verify using prime p, sub-prime n, value g (g’s multiplicative order modulo p equals q), public key y, message digest h (the number of significant bits in n should equal the length of h) and DSA signature values r’ and s’. 1. If r’ or s’ are outside the range 1 … n–1, return with an operand value error. 2. Calculate w = s’ −1 (mod n) (s’ −1 is a modular inversion). 3. Calculate u1 = h·w (mod n) and u2 = r’·w (mod n). 4. Calculate r” = (((gu1·yu2) mod p) mod n) and compare this value to the given r’. Inputs ‘A’ points to public key y (length ‘A’, which must be in range 2 … 130 words), ‘B’ points to prime p followed by value g (both length ‘A’), followed by sub-prime n (length ‘B’, which must be in range 2 … length ‘A’ words), with two or three buffer wordsa between all sub-vectors, can be an encrypted vector, ‘C’ points to message digest h (length ‘B’), ‘E’ points to r’ followed by s’ with two or three buffer words in betweenb (both length ‘B’). Number of ‘odd powers’ (in the range 1 … 16). Result ‘D’ Possible Errors • • • • • • • • Extra Status N/A points to r” (length ‘B’) – when using command code 0x23 returned for external verification of a match, r” = r’ (Compare result = 001), r” < r’ (Compare result = 010), r” > r’ (Compare result = 100). Illegal encrypted parameter use Operand length error Operand value error Modulus is even Modulus too short No inverse exists Odd powers out-of-range Farm memory too small (probably due to too high odd powers setting) a. Two buffer words when length ‘A’ is even, three buffer words when length ‘A’ is odd. b. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd. Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. Note: The ‘E’ pointer to r’ and s’ is not present in the result descriptor. A driver is forced to keep track of the pointer to r’ if external comparison with r” is required. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 351 Appendix C: Miscellaneous Accelerator Specifications Diffie-Hellman Key Exchanges The standard modular-exponentiation based Diffie-Hellman key exchange (DH) just uses basic modular exponentiations (two on both sides, one for generating a value out of a chosen local secret to be sent to the other side and one for calculating a shared secret using the value received from the other side). The modular exponentiation command as described in Modular Exponentiation without CRT (Complex Arithmetic) can be used for these operations. The Elliptic Curve based Diffie-Hellman key exchange (ECDH) is similar in that it uses basic ECC point multiplications only (in its simplest form just multiply the other side’s public key point with the local private key scalar and use the result point’s x-coordinate as shared secret). The ECC point multiplication command described in section ECC Point Multiplication (Complex Arithmetic) can be used for these operations. AES Known Answer (Verify Operations) Command Code 0xE0 Operation Performs an AES operation with the given IV/Key/control/increment information on the given data. Inputs ‘A’ Points to the IV/Key/control/increment information see PKI Key Decrypt Key Management Interface for the exact information layout (length ‘A’, which must be exactly 14 words). ‘B’ Points to the data to decrypt or encrypt (length ‘B’, which must be in range 4 … 508 words and always a multiple of 4 words). Result ‘C’ Points to decrypt or encrypt data (length ‘B’). Possible Errors Illegal encrypted parameter use. Operand length error. Extra Status N/A Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353, “PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector Overlap Restrictions” on page 354. 352 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) C.5.5.4 Restrictions on PKA Operations Note: Failure to comply with these restrictions will result in an incorrect mathematical result, but not necessarily an error code in the result descriptor. Restrictions on Input Vectors for PKCP Operations Table C-57. Operational Restrictions Function Requirements Multiply 0 < A_Len, B_Len < Max_Len Add 0 < A_Len, B_Len < Max_Len Subtract 0 < A_Len, B_Len < Max_Len Result must be positive (A > B) AddSub 0 < A_Len < Max_Len (B and C operands have A_Len as length, B_Len ignored) Result must be positive ((A + C) > B) Right Shift 0 < A_Len < Max_Len Left Shift 0 < A_Len < Max_Len Divide, Modulo 1 < B_Len < A_Len < Max_Len Most significant 32-bit word of B operand cannot be zero Compare 0 < A_Len < Max_Len (B operand has A_Len as length, B_Len ignored) Copy 0 < A_Len < Max_Len PKCP Result Vector Memory Allocation The host is responsible for allocating a block of contiguous memory in PKA RAM for the result vector(s). Table C-58 indicates how much memory should be allocated for the result vector(s). Table C-58. Result Vector Memory Allocation Function Result Vector Result Vector Length (in 32-bit words) Multiply C A_Len + B_Len + 6 (the 6 ‘scratchpad’ words should be discarded) Add C Max(A_Len, B_Len) + 1 Subtract C Max(A_Len, B_Len) AddSub D A_Len + 1 Right Shift C A_Len Left Shift C A_Len + 1 A_Len Divide C Remainder B_Len + 1 (one ‘scratchpad’ word should be discarded) D Quotient A_Len — B_Len + 1 Modulo C Remainder B_Len + 1 Compare None Compare updates the PKA_COMPARE register Copy C A_Len (when Shift Value is non-zero) (when Shift Value is zero) Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 353 Appendix C: Miscellaneous Accelerator Specifications Input vectors for an operation are always allowed to overlap in memory (partially or completely). Table C-58 identifies restrictions for the overlap of output and input vectors of the operations. PKCP Result Vector / Input Vector Overlap Restrictions Table C-59. Result Vector / Input Vector Overlap Restrictions Function Result Vector Restrictions Multiply C No overlap with A or B vectors allowed. Add, Subtract C May overlap with A and/or B vector, provided the start address of the C vector does not lie above the start address of the vector(s) with which it overlaps. AddSub D May overlap with A, B and/or C vector, provided the start address of the D vector does not lie above the start address of the vector(s) with which it overlaps. Right Shift, Left Shift C May overlap with A vector, provided the start address of the C vector does not lie above the start address of the A vector C No overlap with A, B or D vectors allowed. Divide D No overlap with A, B or C vectors allowed. Modulo C No overlap with A or B vectors allowed. Compare None Compare does not write a result vector. Copy C Same restrictions as for Right/Left Shift, copy of a vector to a lower address is always allowed even if source and destination overlap. PKCP Operations Table C-60 lists the arguments and results for each PKCP operation. Table C-60. Summary of PKCP Vector Operations Function Mathematical Operation Vector A Vector B Vector C Vector D Multiply A x B -> C Multiplicand Multiplier Product N/A Add A + B -> C Addend Addend Sum N/A Subtract A — B -> C Minuend Subtrahend Difference N/A AddSub A + C — B -> D Addend Subtrahend Addend Result Right Shift A >> Shift -> C Input N/A Result N/A Left Shift A << Shift -> C Input N/A Result N/A Divide A mod B -> C, A div B -> D Dividend Divisor Remainder Quotient Modulo A mod B -> C Dividend Divisor Remainder N/A Compare A = B, A < B, A > B Input1 Input2 N/A N/A Copy A -> C Input N/A Result N/A To obtain correct result, the input vectors must meet the requirements presented in Table C-61. Note that: • 354 Input restrictions are not checked by the PKCP Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) • A_Len and B_Len indicate the size of vectors A and B in (32-bit) words • Max_Len equals 64 (32-bit) words, i.e. the standard maximum vector size is 2048 bits Note: Maximum vector sizes can be optionally extended to 4096 or 8192 bits (with Max_Len equal to 128 respectively 256). Table C-61. Restrictions on Input Vectors for PKCP Operations Operational Restrictions Function Requirements Multiply 0 < A_Len, B_Len < Max_Len Add 0 < A_Len, B_Len < Max_Len Subtract 0 < A_Len, B_Len < Max_Len Result must be positive (A > B) AddSub 0 < A_Len < Max_Len (B and C operands have A_Len as length, B_Len ignored) Result must be positive ((A + C) > B) Right Shift 0 < A_Len < Max_Len Left Shift 0 < A_Len < Max_Len Divide, Modulo 1 < B_Len < A_Len < Max_Len The most significant 32-bit word of B operand cannot be zero. Compare 0 < A_Len < Max_Len (B operand has A_Len as length, B_Len ignored) Copy 0 < A_Len < Max_Len The host is responsible for allocating a block of contiguous memory in PKA RAM for the result vector(s). Table C-62 indicates how much memory should be allocated for the result vector(s). Table C-62. PKCP Result Vector Memory Allocation Result Vector Memory Allocation Function Result Vector Result Vector Length (in 32-bit words) Multiply C A_Len + B_Len + 6 (the 6 ‘scratchpad’ words should be discarded) Add C Max(A_Len, B_Len) + 1 Subtract C Max(A_Len, B_Len) AddSub D A_Len + 1 Right Shift C A_Len Left Shift C A_Len + 1 (when Shift Value is non-zero) A_Len (when Shift Value is zero) Divide C Remainder -> B_Len + 1 (one ‘scratchpad’ word should be discarded) D Quotient -> A_Len — B_Len + 1 Modulo C Remainder -> B_Len + 1 Compare None Compare updates the PKA_COMPARE register Copy C A_Len Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 355 Appendix C: Miscellaneous Accelerator Specifications Input vectors for an operation are always allowed to overlap in memory (partially or completely). Table C-63 gives restrictions for the overlap of output and input vectors of the operations. Table C-63. PKCP Result Vector / Input Vector Overlap Restrictions Result Vector / Input Vector Overlap Restrictions Function Result Vector Restrictions Multiply C No overlap with A or B vectors allowed. Add, Subtract C May overlap with A and/or B vector, provided the start address of the C vector does not lie above the start address of the vector(s) with which it overlaps. AddSub D May overlap with A, B and/or C vector, provided the start address of the D vector does not lie above the start address of the vector(s) with which it overlaps. Right Shift, Left Shift C May overlap with A vector, provided the start address of the C vector does not lie above the start address of the A vector. Divide C No overlap with A, B or D vectors allowed. D No overlap with A, B or C vectors allowed. Modulo C No overlap with A or B vectors allowed. Compare None Compare does not write a result vector. Copy C Same restrictions as for Right/Left Shift, copy of a vector to a lower address is always allowed even if source and destination overlap.a a. The Copy operation can be used to fill memory by breaking the overlap restrictions, but requires TWO initial (32-bit) words to be set up: To zero a block of memory, set A vector pointer to the block start, set C vector pointer two words higher and A vector length to the block length minus two (words). Fill the first two words of the block with constant zero and perform a PKCP Copy operation to zero the remainder of the block. The Sequencer controls modular exponentiation operations. This document assumes that the Sequencer program ROM/RAM holds code that implements the following modular exponentiation (ExpMod) operations (using the LNME for most of the work if one is available). Table C-64. Summary of ExpMod Operations Function Mathematical Operation Vector A Vector B Vector C Vector D ExpMod-ACT2, ExpMod-ACT4, ExpMod-variable CA mod B -> D Exponent, length = A_Len Modulus, length = B_Len Base,length =B_Len Result & Workspace ExpMod-CRT See below Exp P followed by Exp Q at next higher even word addressa, both A_Len long Mod P + buffer word followed by Mod Q at next higher even word addressb, both B_Len long Q inverse, length = B_Len Input, Result (both 2xB-Len long) & Workspace a. b. 356 If A_Len is even, Exp Q follows Exp P immediately – if A_Len is odd, there is one empty word between Exp Q and Exp P. If B_Len is even, there are two empty words between Mod P and Mod Q – if B_Len is odd, there is one empty (buffer) word between Mod Q and Mod P. Note that the words following Mod P and Mod Q may be zeroed by Sequencer firmware. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) The ExpMod-CRT operation performs the following computation steps1: • X<- (Input mod Mod P)Exp P mod Mod P • Y <- (Input mod Mod Q)Exp Q mod Mod Q • Z <- ((((X – Y) mod Mod P) * Q inverse) mod Mod P) * Mod Q • Result <- Y + Z The ExpMod-ACT2, -ACT4 and -variable functions implement the same mathematical operation but with a differently sized table with pre-calculated ‘odd powers’. The ExpMod-ACT2 function uses a table with two entries whereas ExpMod-ACT4 uses a table with eight entries. The ACT4 version gives better performance but needs more memory. ExpMod-variable and ExpMod-CRT allow a variable amount (from 1 up to and including 16) of odd powers to be selected via the register normally used to specify the number of bits to shift for shift operations. For a user of the PKA Engine, the exponentiation functions appear to be extensions of the set of PKCP functions as described in “PKCP Operations” on page 354. Input and result vectors are passed just like this is done for basic PKCP operations. Table C-65 shows the restrictions on the input and result vectors for the exponentiation operations. Table C-65. Restrictions on Input Vectors for ExpMod Operations Operational Restrictions Function Requirements ExpMod-ACT2, ExpMod-ACT4, ExpMod-variable 1) 0 < A_Len < Max_Len 2) 1 < B_Len < Max_Len 3) Modulus B must be odd (i.e. the least significant bit must be ONE) 4) Modulus B > 232 5) Base C < Modulus B 6) Vectors B and C must be followed by an empty 32-bit ‘buffer’ word 1) 0 < A_Len < Max_Len 2) 1 < B_Len < Max_Len 3) Mod P and Mod Q must be odd (i.e. the least significant bits must be ONE) 4) Mod P > Mod Q > 232 5) Mod P and Mod Q must be co-prime (their GCD must be 1) 6) 0 < Exp P < (Mod P — 1) 7) 0 < Exp Q < (Mod Q — 1) 8) (Q inverse * Mod Q) = 1 (modulo Mod P) 9) Input < (Mod P * Mod Q) 10) Mod P and Mod Q must be followed by an empty 32-bit ‘buffer’ word ExpMod-CRT 1. These steps implement Garner’s recombination algorithm after the basic exponentiations. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 357 Appendix C: Miscellaneous Accelerator Specifications Table C-66 shows the required scratchpad sizes for the exponentiation operations – these depend upon the PKA Engine type6. The M_Len used in the table is the ‘real’ Modulus length (for Mod P in an ExpMod-CRT operation, for Modulus B in the other operations) in 32-bit words, i.e. without trailing zero words at the end. If the last word of the modulus vector as given is non-zero, M_Len equals B_Len. Table C-66. ExpMod Result Vector/Scratchpad Area Memory Allocation Result Vector/Scratchpad Area Memory Allocation (both starting at PKA_DPTR) Function PKA Engine Type Scratchpad Area Size (in 32-bit Words), Result Vector is either M_Len or 2xM_Len 32-bit Words Long ExpMod-ACT2 With LNME (3 x (M_Len + 2 – (M_Len MOD 2)) + 10 PKCP-only 5 x (M_Len + 2) With LNME 9 x (M_Len + 2 – (M_Len MOD 2)) PKCP-only 11 x (M_Len + 2) With LNME Maximum of (3 x (M_Len + 2 – (M_Len MOD 2)) + 10 and (# odd powers + 1) x (M_Len + 2 – (M_Len MOD 2)) PKCP-only (# odd powers + 3) x (M_Len + 2) With LNME Maximum of (4 x (M_Len + 2 – (M_Len MOD 2)) + 10 and (# odd powers + 2) x (M_Len + 2 – (M_Len MOD 2)) PKCP-only (# odd powers + 3) x (M_Len + 2) + (M_Len + 2 – (M_Len MOD 2)) ExpMod-ACT4 ExpMod-variable ExpMod-CRT Note: During execution of an ExpMod-ACT2, -ACT4 or -variable operation, the last 34 bytes of the PKA RAM are used as general scratchpad for the Sequencer’s program execution. The ExpMod-CRT operation requires the last 72 bytes of the PKA RAM as scratchpad. These (fixed location) areas can not overlap with any of the input vectors and/or the D vector scratchpad area, they can be used freely when executing basic PKCP operations. Table C-67. ExpMod Scratchpad Area / Input Vector Overlap Restrictions Result Vector/Scratchpad Area Memory Allocation (both starting at PKA_DPTR) Function Result Vector Restrictions ExpMod-ACT2, ExpMod-ACT4 ExpMod-variable D Scratchpad area starting at D may not overlap with any of the other vectors, except that Base C may be co-located with result vector D to save space (i.e. PKA_CPTR = PKA_DPTR is allowed). ExpMod-CRT D Scratchpad area starting at D may not overlap with any of the other vectors, this is also the location of the main Input vector (with length 2 x B_Len) 358 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) C.5.6 PKI Key Decrypt Key Management Interface The PKI Key Decrypt Key management interface is kept very simple as seen from the PKA side: Four 16-word areas (all in on-chip secure RAM) are set aside to store Key Decrypt Keys, IV values, CTR mode increment values and the control bits for the AES core. Loading of the Key Decrypt Keys is left up to the Host. No functions are defined to actually manage the Key Decrypt Keys other than through direct access from the Host bus. During High Assurance mode boot-up, the locations in secure RAM used for KDK storage and control are used to transfer PKA ‘farm’ engine firmware into the PKA. The whole KDK storage area and control words are then zeroed after the firmware has been copied to its intended internal locations. Table C-68 shows the layout of the secure RAM, holding the parameter and control words for the PKI Black Key Decrypt functionality. These locations are described in further detail in the following sub-sections. The secure RAM address space ranges from 0x10000 to 0x11FFF. Table C-68. Secure RAM Layout for PKI Key Decrypt Key Storage Byte Address (Within Secure RAM) Control/Status Word Name Description 0x0000 PKI_KD_IV_0_0 Initialization vector associated with PKI KDK number 0 (needed for non-ECB modes) 0x0004 PKI_KD_IV_0_1 0x0008 PKI_KD_IV_0_2 0x000C PKI_KD_IV_0_3 0x0010 PKI_KDK_0_0 0x0014 PKI_KDK_0_1 0x0018 PKI_KDK_0_2 0x001C PKI_KDK_0_3 0x0020 PKI_KDK_0_4 0x0024 PKI_KDK_0_5 0x0028 PKI_KDK_0_6 0x002C PKI_KDK_0_7 0x0030 PKI_KDK_CONTROL_0 Control word for PKI KDK number 0, also validity check word. 0x0034 PKI_KD_INCR_0 CTR mode increment value for PKI KDK number 0. 0x0038 - 0x003F Reserved, write zero - Actual PKI Key Decrypt Key number 0: Least significant word always PKI_KDK_0_0, 128 bits key most significant word in PKI_KDK_0_3, 192 bits key most significant word in PKI_KDK_0_5, 256 bits key most significant word in PKI_KDK_0_7 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 359 Appendix C: Miscellaneous Accelerator Specifications Table C-68. Secure RAM Layout for PKI Key Decrypt Key Storage (continued) 360 Byte Address (Within Secure RAM) Control/Status Word Name Description 0x0040 PKI_KD_IV_1_0 Initialization vector associated with PKI KDK number 1 (needed for non-ECB modes). 0x0044 PKI_KD_IV_1_1 0x0048 PKI_KD_IV_1_2 0x004C PKI_KD_IV_1_3 0x0050 PKI_KDK_1_0 0x0054 PKI_KDK_1_1 0x0058 PKI_KDK_1_2 0x005C PKI_KDK_1_3 0x0060 PKI_KDK_1_4 0x0064 PKI_KDK_1_5 0x0068 PKI_KDK_1_6 0x006C PKI_KDK_1_7 0x0070 PKI_KDK_CONTROL_1 Control word for PKI KDK number 1, also validity check word. 0x0074 PKI_KD_INCR_1 CTR mode increment value for PKI KDK number 1. 0x0078 - 0x007F Reserved, write zero - 0x0080 PKI_KD_IV_2_0 Initialization vector associated with PKI KDK number 2 (needed for non-ECB modes). 0x0084 PKI_KD_IV_2_1 0x0088 PKI_KD_IV_2_2 0x008C PKI_KD_IV_2_3 0x0090 PKI_KDK_2_0 0x0094 PKI_KDK_2_1 0x0098 PKI_KDK_2_2 0x009C PKI_KDK_2_3 0x00A0 PKI_KDK_2_4 0x00A4 PKI_KDK_2_5 0x00A8 PKI_KDK_2_6 0x00AC PKI_KDK_2_7 0x00B0 PKI_KDK_CONTROL_2 Control word for PKI KDK number 2, also validity check word. 0x00B4 PKI_KD_INCR_2 CTR mode increment value for PKI KDK number 2. 0x00B8 - 0x00BF Reserved, write zero - Actual PKI Key Decrypt Key number 1: Least significant word always PKI_KDK_1_0, 128 bits key most significant word in PKI_KDK_1_3, 192 bits key most significant word in PKI_KDK_1_5, 256 bits key most significant word in PKI_KDK_1_7. Actual PKI Key Decrypt Key number 2: Least significant word always PKI_KDK_2_0, 128 bits key most significant word in PKI_KDK_2_3, 192 bits key most significant word in PKI_KDK_2_5, 256 bits key most significant word in PKI_KDK_2_7. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) Table C-68. Secure RAM Layout for PKI Key Decrypt Key Storage (continued) Byte Address (Within Secure RAM) Control/Status Word Name Description 0x00C0 PKI_KD_IV_3_0 Initialization vector associated with PKI KDK number 3 (needed for non-ECB modes). 0x00C4 PKI_KD_IV_3_1 0x00C8 PKI_KD_IV_3_2 0x00CC PKI_KD_IV_3_3 0x00D0 PKI_KDK_3_0 0x00D4 PKI_KDK_3_1 0x00D8 PKI_KDK_3_2 0x00DC PKI_KDK_3_3 0x00E0 PKI_KDK_3_4 0x00E4 PKI_KDK_3_5 0x00E8 PKI_KDK_3_6 0x00EC PKI_KDK_3_7 0x00F0 PKI_KDK_CONTROL_3 Control word for PKI KDK number 3, also validity check word. 0x00F4 PKI_KD_INCR_3 CTR mode increment value for PKI KDK number 3. 0x00F8 - 0x00FF Reserved, write zero - 0x0100 - 0x1FFF Internal use, do not modify Holds PKI command/result ring management, general PKA master controller scratchpad, command pre- and post-data areas. Actual PKI Key Decrypt Key number 3: Least significant word always PKI_KDK_3_0, 128 bits key most significant word in PKI_KDK_3_3, 192 bits key most significant word in PKI_KDK_3_5, 256 bits key most significant word in PKI_KDK_3_7. The layout of the PKI_KD… words in secure RAM is chosen so that they can be transferred as one block into the control registers of the PKA’s local AES core using the local DMA engine. C.5.6.1 AES Byte Order Example The following example is based on NIST Special Publication 800-38A, Appendix F (for additional information about this publication, see the reference on “NIST Special Publication 800-38A” on page 579). As indicated in the example below, PKI_KDK, PKI_KD_IV and Input Data (data where the pointers ‘A’...’E’ in Table C-53 refer to) need to be byte swapped when copied into the local AES engine. After processing, the output data leaves the AES engine as the decrypted ‘Input Data’. The red color indicates the first byte of each item. Listing C-6. AES-CTR 128-Bit Decrypt. NIST SP 800-38a: AES Key In: 2b7e1516_28aed2a6_abf71588_09cf4f3c IV/CTR In: f0f1f2f3_f4f5f6f7_f8f9fafb_fcfdfeff AES Data In: 874d6191_b620e326_1bef6864_990db6ce AES Data Out: 6bc1bee2_2e409f96_e93d7e11_7393172a KDK storage locations in secure RAM (input): PKI_KDK_x_0[31:0]: 0x16157e2b PKI_KDK_x_1[31:0]: 0xa6d2ae28 PKI_KDK_x_2[31:0]: 0x8815f7ab Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 361 Appendix C: Miscellaneous Accelerator Specifications PKI_KDK_x_3[31:0]: 0x3c4fcf09 PKI_KD_IV_x_0[31:0]: 0xf3f2f1f0 PKI_KD_IV_x_1[31:0]: 0xf7f6f5f4 PKI_KD_IV_x_2[31:0]: 0xfbfaf9f8 PKI_KD_IV_x_3[31:0]: 0xfffefdfc PKI_KDK_CONTROL_x: 0xffdf0020 (AES-CTR 128-bit Decrypt) PKI_KD_INCR_x[31:0]: 0x00000001 Encrypted Input Data (Local AES core input): ENC_DATA_0[31:0]: 0x91614d87 ENC_DATA_1[31:0]: 0x26e320b6 ENC_DATA_2[31:0]: 0x6468ef1b ENC_DATA_3[31:0]: 0xceb60d99 Decrypted Data (Local AES core output): AES_DATA_IO_0[31:0]: 0xe2bec16b AES_DATA_IO_1[31:0]: 0x969f402e AES_DATA_IO_2[31:0]: 0x117e3de9 AES_DATA_IO_3[31:0]: 0x2a179373 C.5.6.2 PKI Key Decrypt Keys Storage (PKI_KDK_0_[0:7] … _3_[0:7]) The PKI Key Decrypt Keys storage uses four times 8 words at the start of the secure RAM. Keys are stored with their least significant word first. A 128 bits key only uses the first 4 words of each storage area, while a 192 bits key uses the first 6 words of each storage area. For byte order information, please refer to AES Byte Order Example. Note: The PKI Key Decrypt Keys do not need to be pre-processed AES ‘decrypt’ keys; conversion of normal AES keys stored here to AES decrypt keys is done automatically within the local AES core. PKI_KDK_0_[0:7] PKI_KDK_0_[0:7] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10010-0x1002F 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x X x x x x x KDK_0 x x x x x x x x x x x x x x x x x x x x x x x Table C-69. PKI_KDK_0_[0:7] Bit Descriptions 362 Bits Name Type Function [31:0] KDK_0 R/W Eight consecutive words holding PKI Key Decrypt Key number 0 (located in secure RAM). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) PKI_KDK_1_[0:7] PKI_KDK_1_[0:7] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10050-0x1006F 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x X x x x x x KDK_1 x x x x x x x x x x x x x x x x x x x x x x x Table C-70. PKI_KDK_1_[0:7] Bit Descriptions Bits Name Type Function [31:0] KDK_1 R/W Eight consecutive words holding PKI Key Decrypt Key number 1 (located in secure RAM). PKI_KDK_2_[0:7] PKI_KDK_2_[0:7] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10090-0x100AF 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KDK_2 x x x x x x x x x x x x x x x x x x x x x x x Table C-71. PKI_KDK_2_[0:7] Bit Descriptions Bits Name Type Function [31:0] KDK_2 R/W Eight consecutive words holding PKI Key Decrypt Key number 2 (located in secure RAM). PKI_KDK_3_[0:7] PKI_KDK_3_[0:7] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100D0-0x100EF 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KDK_3 x x x x x x x x x x x x x x x x x x x x x x x Table C-72. PKI_KDK_3_[0:7] Bit Descriptions Bits Name Type Function [31:0] KDK_3 R/W Eight consecutive words holding PKI Key Decrypt Key number 3 (located in secure RAM). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 363 Appendix C: Miscellaneous Accelerator Specifications C.5.6.3 PKI Key Decrypt IVs Storage (PKI_KD_IV_0_[0:3] … _3_[0:3]) The PKI Key Decrypt Initialization Vector (IV) storage uses four times 4 words at the start of the secure RAM. IVs are stored with their least significant word first. For byte order information, please refer to AES Byte Order Example. PKI_KD_IV_0_[0:3] PKI_KD_IV_0_[0:3] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10000-0x1000F 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KD_IV_0 x x x x x x x x x x x x x x x x x x x x x x x Table C-73. PKI_KD_IV_0_[0:3] Bit Descriptions Bits Name Type Function [31:0] KD_IV_0 R/W Four consecutive words holding the Initialization Vector associated with PKI Key Decrypt Key number 0 (located in secure RAM). PKI_KD_IV_1_[0:3] PKI_KD_IV_1_[0:3] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10040-0x1004F 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KD_IV_1 x x x x x x x x x x x x x x x x x x x x x x x Table C-74. PKI_KD_IV_1_[0:3] Bit Descriptions 364 Bits Name Type Function [31:0] KD_IV_1 R/W Four consecutive words holding the Initialization Vector associated with PKI Key Decrypt Key number 1 (located in secure RAM). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) PKI_KD_IV_2_[0:3] PKI_KD_IV_2_[0:3] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10080-0x1008F 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KD_IV_2 x x x x x x x x x x x x x x x x x x x x x x x Table C-75. PKI_KD_IV_2_[0:3] Bit Descriptions Bits Name Type Function [31:0] KD_IV_2 R/W Four consecutive words holding the Initialization Vector associated with PKI Key Decrypt Key number 2 (located in secure RAM). PKI_KD_IV_3_[0:3] PKI_KD_IV_3_[0:3] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100C0-0x100CF 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KD_IV_3 x x x x x x x x x x x x x x x x x x x x x x x Table C-76. PKI_KD_IV_3_[0:3] Bit Descriptions Bits Name Type Function [31:0] KD_IV_3 R/W Four consecutive words holding the Initialization Vector associated with PKI Key Decrypt Key number 3 (located in secure RAM). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 365 Appendix C: Miscellaneous Accelerator Specifications C.5.6.4 PKI Key Decrypt CTR Mode Increment Storage (PKI_KD_INCR_0 … _3) The PKI Key Decrypt Keys must be accompanied by a 32 bits increment value when CTR mode decrypt is being used. These increment values are stored in secure RAM and are copied to the local AES core’s AES_INC register when needed. Note that this is an internal register that is not accessible by the Host. For byte order information, please refer to “AES Byte Order Example” on page 361. PKI_KD_INCR_0 PKI_KD_INCR_0 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10034-0x10037 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KD_INCR_0 x x x x x x x x x x x x x x x x x x x x x x x Table C-77. PKI_KD_INCR_0 Bit Descriptions Bits Name Type Function [31:0] KD_INCR_0 R/W One word holding the CTR mode increment value associated with PKI Key Decrypt Key number 0 (located in secure RAM). PKI_KD_INCR_1 PKI_KD_INCR_1 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10074-0x10077 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KD_INCR_1 x x x x x x x x x x x x x x x x x x x x x x x Table C-78. PKI_KD_INCR_1 Bit Descriptions 366 Bits Name Type Function [31:0] KD_INCR_1 R/W One word holding the CTR mode increment value associated with PKI Key Decrypt Key number 1 (located in secure RAM). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) PKI_KD_INCR_2 PKI_KD_INCR_2 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100B4-0x100B7 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KD_INCR_2 x x x x x x x x x x x x x x x x x x x x x x x Table C-79. PKI_KD_INCR_2 Bit Descriptions Bits Name Type Function [31:0] KD_INCR_2 R/W One word holding the CTR mode increment value associated with PKI Key Decrypt Key number 2 (located in secure RAM). PKI_KD_INCR_3 PKI_KD_INCR_3 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100F4-0x100F4 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x KD_INCR_3 x x x x x x x x x x x x x x x x x x x x x x x Table C-80. PKI_KD_INCR_3 Bit Descriptions Bits Name Type Function [31:0] KD_INCR_3 R/W One word holding the CTR mode increment value associated with PKI Key Decrypt Key number 3 (located in secure RAM). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 367 Appendix C: Miscellaneous Accelerator Specifications C.5.6.5 PKI Key Decrypt Key Control Words Each of the four PKI Key Decrypt Keys (KDK) has a separate control word whose bits [15:0] are transferred to the local AES core’s AES_MODE register. Note that this is an internal register that is not accessible by the Host. A KDK is assumed to be valid when bits [15:10] are zero and bits [31:16] are the bit-by-bit complement of bits [15:0]. PKI_KDK_CONTROL_x PKI_KDK_CONTROL_0 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10030-0x10033 PKI_KDK_CONTROL_1 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10070-0x10073 PKI_KDK_CONTROL_2 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100B0-0x100B3 x x x x x x x x x x x x 4 3 2 key_size x 5 ecb x 6 cbc x 7 ctr x x x x x x 1 0 Must be zeroes x x x x x x x x x encrypt Bit-by-bit complement of bits [9:0] cfb_width Must be ones 8 ofb 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 cfb PKI_KDK_CONTROL_3 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100F0-0x100F3 x x Table C-81. PKI_KDK_CONTROL_x Bit Descriptions Bits Name Type Function [31:22] Must be ones R/W This field must be all ones to have this KDK considered as valid. [25:16] Bit-by-bit complement of bits [9:0] R/W This field should contain the bit-by-bit complement of cfb_width … encrypt bits [9:0] to have this KDK considered as valid. [15:10] Must be zeroes R/W This field must be all zeroes to have this KDK considered as valid. [9:8] cfb_width R/W Sets the number of bits fed back for the CFB operation. ‘00’ feeds back 128 bits, ‘01’ feeds back 1 (ONE) bit, ‘10’ feeds back 8 bits, ‘11’ is reserved, do not use. [7] cfb R/W Indicates Cipher Feed-Back mode operations are to be performed, mutually exclusive with the other mode selection bits (exactly one of these five must be set to ‘1’). [6] ofb R/W Indicates Output Feed-Back mode operations are to be performed, mutually exclusive with the other mode selection bits (exactly one of these five must be set to ‘1’). [5] ctr R/W Indicates Counter mode operations are to be performed, mutually exclusive with the other mode selection bits (exactly one of these five must be set to ‘1’). [4] cbc R/W Indicates Cipher Block Chaining mode operations are to be performed, mutually exclusive with the other mode selection bits (exactly one of these five must be set to ‘1’). [3] ecb R/W Indicates Electronic Code Book mode operations are to be performed, mutually exclusivea with the other mode selection bits (exactly one of these five must be set to ‘1’). [2:1] key_size R/W These two write only bits specify the key length to use. ‘00’ selects 128-bit keys, ‘01’ selects 192 bit keys, ‘10’ selects 256 bits keys, ‘11’ is reserved, do not use. [0] encrypt R/W Specifies encrypt (‘1’) or decrypt (‘0’) operation to be performed. Will normally be ‘0’ a. Actually, there is a priority encoding of the five mode control bits done, but we advise NOT to use this feature. 368 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Public Key Accelerator (PKA) C.5.7 PKI Engine Boot-Up and Internal Error Reporting The PKA internal firmware uses the PKA_MASTER_SEQ_CTRL register accessible through the Host interface (in non-High Assurance mode) for internal error reporting and ‘side-channel’ control. Table C-82. PKA_MASTER_SEQ_CTRL Register Bit Descriptions Bits Name Type Function [31] Reset R/W Active HIGH software reset for the PKA master controller, enables access to PKA master controller program RAM when ‘1’. [30:16] RESERVED -- Reserved: write zeroes and ignore on read. [15:8] Status R This field conveys status information. Bit [8] is used to generate the PKA master interrupt (always set on an error), bit [15] indicates an actual error situation: 0x00: No error 0x01: No error, used to trigger the Host software during bootup 0x83: List full error 0x85: Process sequence state error 0x87: Invalid address error 0x89: DMA error 0x8B: Invalid use/setting 0x8D: Invalid or no command 0x8F: Invalid farm number 0xFD: Function not available 0xFF: Severe error (suspected deadlock) When an error is reported (bit [15] is HIGH), the buffer RAM word at Host window offset address 0x00074 (that is the word following the control word PKA_RING_OPTIONS) holds a pointer into the firmware code that indicates the instruction where the error was detected – the firmware itself is halted. [7] SW_reset Set-only Set this bit HIGH to abort all operations in the PKA gracefully (that is without breaking off ongoing Host transfers). Automatically reset LOW after handling. [6:2] RESERVED -- Reserved: write zeroes and ignore on read. [1] Reset_DMA Set-only Set this bit HIGH to reset the internal DMA channel accessing the Host interface. Use this after a DMA error has been reported. Automatically reset LOW after handling. [0] RESERVED -- Reserved: write zero and ignore on read. The boot-up sequence of the PKA requires three firmware images: • Master boot image • Farm engine execution image • Master execution image The sequence to boot up the PKA is as follows: 1. Load the master boot image into PKA_MASTER_PROG_RAM. 2. Load farm engine execution image in PKA_BUFFER_RAM (non-High Assurance mode) or in PKA_SECURE_RAM (High Assurance mode). 3. Take the PKA master controller out of reset (clear bit [31] of the PKA_MASTER_SEQ_CTRL register) – this starts distribution of the farm engine execution image to the farm engines and performs other preparatory steps. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 369 Appendix C: Miscellaneous Accelerator Specifications 4. Wait until the pka_master_irq becomes active (poll the AIC_RAW_STAT register or poll bit [8] of the PKA_MASTER_SEQ_CTRL). 5. Verify that the PKA master controller set bits [15:8] of the PKA_MASTER_SEQ_CTRL register to value 0x01. If that is not the case, then the boot image program encountered an error. 6. Push the PKA master controller into reset (set bit [31] of the PKA_MASTER_SEQ_CTRL register). 7. Load master execution image into PKA_MASTER_PROG_RAM. 8. Take the PKA master controller out of reset (clear bit [31] of the PKA_MASTER_SEQ_CTRL register; this starts the actual execution image. 9. Write ring configuration and control words at the start of the PKA_BUFFER_RAM and then write the PKA_RING_OPTIONS register. The PKA is now ready to receive the first command. Note: The pka_mst_clk must be running during the whole boot up sequence. After this sequence, the PKA is operational and it will be able to handle commands. Note that when the first command descriptor is set up and executed, the PKA will automatically enter the execution stage. C.6 Conventions Used in this Manual C.6.1 Register Information Registers within this document are shown as follows: REGISTER_HEAD (Write Only), 18-bit Address in Host Target Window: 0x00000 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The register name, accessibility and the Host address location for direct access are on the top lines. The table shows all the register bit fields, a supporting description is included in the text below the register table. Reserved fields are shaded gray. The bottom row in the register graphic shows the power-up / reset default setting of the register for read. 370 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice A PPENDIX D: I NLINE P ACKET E NGINE D.1 Crypto Packet Processor Processing Overview This appendix provides an overview of the Crypto Packet Processor and its function in the following sections: • D.1 “Crypto Packet Processor Processing Overview” on page 371 • D.2 “Configuring the Crypto Packet Processor” on page 372 • D.3 “Pseudo Random Number Generator” on page 375 • D.4 “Input Token Definition” on page 379 • D.5 “Processing Instructions” on page 392 • D.6 “Result Token Definition” on page 429 • D.7 “Pre and Post-Processing by Host Software” on page 432 • D.8 “Context Record Definition” on page 433 • D.9 “Register and Memory Map” on page 456 • D.10 “Protocol Compliancy” on page 481 D.1.1 Crypto Packet Processor Terms This appendix makes frequent use of the terms token and context (see www.iana.org/assignments/protocol- numbers). These terms refer to data structures that the Crypto Packet Processor uses to perform packet processing operations. In IPSec, a context is a (packet independent) data structure that contains key material and processing parameters associated with the processing of packets that have been recognized (classified) to be sent through a specific IPSec tunnel. In IPSec terminology, the Security Association (SA) structure that defines an IPSec tunnel is represented as the context in the Crypto Packet Processor. Because the Crypto Packet Processor can support more protocols than just IPSec, the Crypto Packet Processor context contains more parameters than just the IPSec SA parameters. The term context is often said to describe a packet transform. The term transform stems from the definition: if a packet is processed according to the rules specified in the context, the packet is said to have been transformed by the Crypto Packet Processor. Alternate terms for the transformation of packets in this way are tunneling or encapsulation, which specifically refer to packet transformations requiring the encryption of plaintext data, and the detunneling or decapsulation, the reverse operation. Clearly, a context contains data that is not packet specific; multiple packets can be processed (transformed) under the same context. The term token refers to a data structure that is created for each (individual) packet. A token is a specific data structure containing processing commands and instructions that the Crypto Packet Processor uses to process one specific packet. Among other things, a token contains a number of parameters extracted from the packet itself, for example, packet or packet header length, specific packet header fields or offsets to such fields in the packet stream, that is, data Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice D.1 371 Appendix D: Inline Packet Engine that can change with each packet. A Crypto Packet Processor token also contains processing instructions used to control the detailed packet processing operation performed by the Crypto Packet Processor. D.1.1.1 Tokens The input token is provided as part of the operation’s Extra Data (see “Operand Data Specification” on page 181). The token provided to the Crypto Packet Processor must have at least five words: four words that are always the first four words of an input token and one instruction word. The first four words contain pointers, options and packet length. The next token words can contain instructions or data. The length of the input token is limited by the maximum size allowed in Extra Data. The order and sequence of instructions and data in the input token is restricted and is explained in “Input Token Definition” on page 379. The result token can contain four to eight data words and contains the result-packet length, error flags, and packet result information. See “Result Token Definition” on page 429. D.1.1.2 Context The context data is provided as part of the operation’s Extra Data (see “Operand Data Specification” on page 181). The context data contains processing parameters and keying data. The first two words of each context contain the options and information about the available fields in the context. The sequence of the different fields within a context is fixed, but not all fields need to be available. The length of a field is variable and depends on the selected options and algorithms in the first two context words. For optimal memory usage, all available fields are concatenated to each other. D.2 Configuring the Crypto Packet Processor Before the Crypto Packet Processor can be used for packet processing, the Host must initialize its configuration registers. The configuration registers are accessible in the Engine Global Address Space. This section will discuss some general principles and methods for initializing and using the Crypto Packet Processor efficiently. General concepts are presented here; details on the registers used for configuration can be found in D.9 “Register and Memory Map” on page 456. D.2.1 Enabling Protocol and Algorithm Support First, you must decide if you want protocol and algorithm support to be enabled. By default, all implemented cryptographic protocols and algorithms in the Crypto Packet Processor are enabled. Individual algorithms as well as protocol support can be disabled by using the Protocol/Algorithm Enable register, see “Protocol/Algorithm Enable Register” on page 466. If you enable protocol support it allows software on the Host system to determine the hardware capabilities. If an algorithm is disabled and the software tries to use that algorithm, an error will be is generated. D.2.2 Context Fetch Modes The second decision relates to the context fetching. The Crypto Packet Processor supports two modes for context fetching: 1. 372 Address mode. In Address mode, the context fields are fetched from the Context Record Address Map shown in “Context Record Format” on page 433, always starting from Context Control Word 0. Irrelevant fields can be filled with dummy values. The number of context words fetched is indicated by the Context Size (in dwords) located in the Context Control Register (see “Context Control Register” on page 466). This fetch mode can be used when the size of the context is fixed, that is independent of the algorithm and mode used. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Configuring the Crypto Packet Processor 2. Control mode. In Control mode, the context fields are fetched from a customized Context Record, based on the control bits in the context control words. Control mode optimizes the context fetch by supporting a customized context record layout containing only relevant fields (abutted to each other) for the requested operation. When bit C in the Input Token Header is set (see “C: Context Control Words Present in Token” on page 381), the fetch does not include the Context Control Words 0/1, since these were fetched with the token. The fetch begins after Context Control Word 1 (at starting address 0x02) with a length defined by the context length field in Context Control Word 0. When bit C in the Input Token Header is not set, the fetch begins with Context Control Word 0 and number of fetched context words is indicated by the Context Size (in dwords) located in the Context Control Register, similar to address mode. This fetch mode allows for using an optimal context record size and reducing the overhead caused by fetching unused fields. Table D-1 outlines how the selection of the Context Fetch Mode is arbitrated between settings of the Context Control Register, bits [9:8] and the Context Control Word 1, bit [31]. Table D-1. Context Fetch Control Context Control Register, Bits [9:8] Context Control Word 1, Bit [31] Context Fetch Mode 00 x address mode (default) 01 x address mode 10 x control mode 11 0 control mode 11 1 address mode See also “Context Control Register” on page 466 and “Context Record Definition” on page 433. D.2.3 Packet Processing Modes The packet-processing module contains all crypto and hashing blocks including a programmable interconnect mechanism for these blocks, which allows for various processing modes. The packet-processing mode is controlled by the Type of Packet field in the context control word 0 (see “Control Word 0 Field Encoding” on page 436). All possible configurations of the datapath are shown in the figures below. direction crossbar packet processor output input hash crypto output Blocked Blocked Figure D-1: Packet Processor with ToP=4’b000x Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 373 Appendix D: Inline Packet Engine packet processor direction crossbar direction crossbar packet processor output output input Hash hash digest output context M U X input M U X Blocked Blocked crypto output hash crypto Crypto Figure D-2: Packet Processor with ToP=4’b001x (left) and with ToP=4’b010x (right) direction crossbar packet processor direction crossbar output output input packet processor hash only hash output Hash context encrypt-then-hash decrypt-then-hash M U X Hash hash crypto Crypto crypto input digest output context M U X Crypto Figure D-3: Packet Processor with ToP=4’b011x (left) and with ToP=4’b111x (right) Note: Because the system is so complex, these figures represent the functional behavior of the system and not the actual physical implementation. As shown in the figures above, the input data can be directed to three possible destinations within the packet processor (crypto, hash, and/or output), defined by the Type of destination field (ToD) of the token instruction. Output of the packet processor is always passed to the post-processor module. The table below describes the relation between Type of destination field of the instruction and Type of Packet field (ToP) of context control word0. In summary, the ToD field determines where the input data should be made available in the packet processor (the crypto block and/or the hash block and/or passed to the output). The ToP field in the context configures the datapath within the packet processor itself, enabling crypto and hash blocks when needed and determining the order of these operations. Table D-2. Relation between ‘Type of Destination’ and ‘Type of Packet’ Fields ‘Type of destination’ Field Operation Crypto Hash Output 0 0 1 xxxx1 Pass data to the output. 0 1 0 xx1x Pass data to the hash engine. xx0x Remove data. xx1x Pass data to the hash engine and also to the output. xx0x Pass data to the output. 0 374 ‘Type of Packet’ Field 1 1 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Pseudo Random Number Generator Table D-2. Relation between ‘Type of Destination’ and ‘Type of Packet’ Fields (continued) ‘Type of destination’ Field Crypto Hash Output 1 0 0 1 1 1 1 0 1 1 0 1 1 ‘Type of Packet’ Field Operation x1xx Pass data to the crypto engine; Encrypted/decrypted data are ignored. x0xx Remove data x1xx Pass data to crypto engine and after encryption/decryption pass to the output. x0xx Pass data to the output. 000x Remove data. 001x Pass data only to the hash engine. x10x Pass data to the crypto engine; Encrypted/Decrypted data are ignored. 011x Pass data to the crypto engine; Encrypted/decrypted data are passed to the hash engine. 111x Pass data to the crypto engine and pass the same data to the hash engine; Encrypted/Decrypted data are ignored. 000x Pass data to the output. 001x Pass data to the output and at the same time to the hash. x10x Pass data to the crypto engine and after encryption/decryption pass to the output. 011x Pass data to the crypto engine; Encrypted/decrypted data are passed to the hash engine and to the output. 111x Pass data to the crypto engine and pass the same data to the hash engine; Encrypted/decrypted data are passed to the output. Note: x = Don’t care. D.3 Pseudo Random Number Generator This section describes the Pseudo Random Number Generator (PRING), its purpose, architecture, and function. D.3.1 Purpose This Crypto Packet Processor includes an ANSI X9.17 compliant Pseudo Random Number Generator (PRNG) that provides a pseudo random data for generating keys, Initialization Vectors (IVs), etc. It provides up to 16-bytes of data at a time, enabling the generation of random IVs for Data Encryption Standard (DES), Triple-DES and Advanced Encryption Standard (AES) on a per packet basis without slowing down the system. The DES block within the PRNG uses a 64-bit LFSR as input data. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 375 Appendix D: Inline Packet Engine Unlike true random number generators, which exploit the randomness that occurs in some physical phenomena, pseudo random number generators are devices or algorithms that output statistically independent and unbiased numbers. In general, a PRNG is a deterministic algorithm that for a given truly random binary sequence (provided as a combination of the seed and key registers), outputs a binary sequence that “appears” to be random. D.3.2 Architecture The PRNG module architecture diagram is shown in Figure D-4. PRNG Processor Interface seed_l seed_h key0_l key0_h key1_l key1_h res0_l res0_h res1_l res1_h lfsr_l lfsr_h ctrl stat enable auto control PRNG Control PRNG Counter Triple-DES start_des des_rdy DES Control control seed lfsr result0/1 control PRNG Datapath XOR-logic result key0/1 DES Datapath Figure D-4: Pseudo Random Number Generator Architecture Diagram The Control logic contains a state machine that controls the pseudo random number generation process. The PRNG module is configured for an operation through a set of registers in Engine Global Register Space. To determine the status of the PRNG module access the PRNG_STAT register. D.3.3 Functional Description The operation of the Crypto Packet Processor internal PRNG is based on the “ANSI X9.17, Annex C example pseudo-random key and IV generation algorithm”. The PRNG can generate 64-bit or 128-bit pseudo-random numbers. A 128-bit result is generated by running the operation for 64-bit numbers twice, returning the results in the PRNG_RES0 and PRNG_RES1 registers. When a 64-bit result is requested, only PRNG_RES0 is used. A 64-bit result is generated as follows: I = ede * K ( DT ) R = ede * K ( I ⊕ V ). And, a new V is generated by: V = ede * K ( R ⊕ I ). 376 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Pseudo Random Number Generator Here, ede means encryption-decryption-encryption as one form of a Triple-DES operation (see Figure D-5). A ciphertext C is calculated from a plaintext P using the following formula: C = E K 0 [ D K1 [ E K 0 [ P ]]]. The key pair K0 and K1 is reserved only for the generation of keys, so they should not be the same as any previously known values. P K0 K1 K0 E D E C Figure D-5: Multiple Encryption with Triple DES Using Two Keys The algorithm consists of three Triple-DES operations that use the same key pair *K. A schematic overview of the algorithm is provided below in Figure D-6. Key0/1 LFSR K0,K1 DT TripleDES 1 I XOR TripleDES 2 XOR V TripleDES 3 R Seed Result Figure D-6: Schematic Overview of the Pseudo-Random Algorithm Specified by ANSI X9.17 Input to the first Triple-DES operation is a secret value DT (DT stands for date/time). Since DT must be updated on each pseudo-random number generation, it is the output of an LFSR (see “Generation of DT” on page 377). Output of the first Triple-DES operation is the intermediate value I, which is stored for later use. Input to the second Triple-DES operation is the exclusive-or operation of I with a secret seed value V, which can be an arbitrary number. Output of the second Triple-DES operation is the vector R, the most important pseudo-random number for us. Input to the third Triple-DES operation is the exclusive-or operation of the intermediate value I with R. Output of the third Triple-DES operation is an updated seed value that is stored and used as input for the next second Triple-DES operation on the next key generation. All numbers are 64-bit wide, except for the key pair that consists of two 56-bit keys. D.3.4 Generation of DT The plaintext input DT is generated using a 64-bit length LFSR output. The LFSR is based on the primitive polynomial f(x) = x64 + x63 + x61 + x60 + 1 (see Figure D-7). After reset, the Host must seed the LFSR through the APB slave interface using two write transfers. The seed for the LFSR can be any value except all zeroes. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 377 Appendix D: Inline Packet Engine 0 D Q 1 59 D Q Reg 60 XOR D Reg Ck Q 61 XOR D D Reg Reg Ck 62 Q 63 XOR D Q 64 Reg Reg Ck Ck Q Ck Ck lfsr[63:0] Figure D-7: Diagram of 64-Bit LFSR to Generate Parameter DT The parameter DT could easily be the output of a counter or any other circuitry that produces a unique number. For the Crypto Packet Processor an LFSR is used because of the low gate count and improved timing. D.3.5 Generation of Keys The key pair, Key0 and Key1, is implemented using two 56-bit length LFSRs, one for each key. The LFSR is based on the primitive polynomial f(x) = x56 + x54 + x52 + x49 + 1 (see Figure D-8). After reset the Host must seed the LFSR using two write transfers. The seed for the LFSR can be any value except all zeroes. 0 D Q 1 48 D Reg Ck Q 49 Reg Ck XOR D Q Reg Ck 50 D Q Reg Ck 51 D Q 52 XOR Reg D Q 53 Reg Ck Ck D Q 54 XOR D Q 55 Reg Reg Ck Ck D Q 56 Reg Ck lfsr[63:0] Figure D-8: Diagram of 56-bit LFSR to Generate Key Pair (Key0 and Key1) The keys could just as well be the output of a counter or any other circuitry that produces a unique number. For the Crypto Packet Processor, an LFSR is used because of the low gate count and improved timing. D.3.6 Performance The PRNG can produce a subsequent 64-bit pseudo random number every (150) system clock cycles, or a 128-bit pseudo random number every (300) system clock cycles. See Table D-3. Table D-3. PRNG Performance Random Bit Rate Number of Clock Cycles 150MHz 64-bit Pseudo Random number word rate 150 64 Mbits/sec 128-bit Pseudo Random number word rate 300 64 Mbits/sec Note: When the Crypto Packet Processor is using AES, the PRNG should generate 128-bit numbers and, hence, bit [2] of the PRNG Control (PRNG_CTRL) register must be set to 1 in order to get sufficiently strong random data. For (3)DES, this bit can be set to 0 but in a typical case where both DES and AES are used, it is recommended that you set this bit to 1 and leave it at this value. Refer to “PRNG Control Register (PRNG_CTRL)” on page 475. 378 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Input Token Definition D.4 Input Token Definition D.4.1 Introduction The Crypto Packet Processor Inline Packet Engine processes packets using the instructions from an input token. The token, read via a dedicated interface, consists of a header and a set of instructions (commands). Information in the token header initiates and controls the packet data and context fetching. The instructions that follow the token header control the packet processing itself, a process that is started when both context and packet data are available. Bypass data can be located at the end of the token after the processing instructions, where it will be passed to the result token without modification. A token header must always contain a fixed set of four dwords. These header fields contain general information that is required for every packet such as pointers to the packet data and context and packet length. The token has a minimum size of five dwords. Although the token length is unlimited, a token length is typically in the range of 10 and 30 dwords. D.4.2 Input Token Diagram 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 - - U IV C ToO RC (CT) - input packet length input packet pointer output packet pointer transform record pointer / context pointer context control 0 (optional) context control 1 (optional) IV0..1..2..3 (optional) reserved checksum (optional) processing instructions bypass token data (optional) Note: All bits marked with dashes (-) are reserved and should be set to 0. D.4.2.1 Input Token Header The Input Token Header consists of the initial four (required) dwords plus the optional dwords that may include the context control words [1:0], the IV [3:0], and a 16-bit checksum. Token Control Word (token dword [0], Required) This section describes the fields of the Token Control Word, the first dword of the input token. IV: Usage and Selection Table D-1 shows typical examples of how the IV registers can used for five types of crypto operations: DES/3DES-CBC, AES-CTR, AES-ICM and AES-CBC. Note that in the Crypto Packet Processor, IV fields (IV0 through IV3) can be taken from four possible sources: 1. Context Record Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 379 Appendix D: Inline Packet Engine Table D-4. Token Control Word Name Description Input Packet Length This field must equal the number of packet bytes that needs to be fetched and processed by the Crypto Packet Processor. CT – Context Type (Reserved) Only relevant when parallel operating engines are employed in combination with system level SA management; set to ‘00’ when only one engine is employed. For systems containing multiple instances of the engine, it is possible for one context to be used by several engines simultaneously. If in this case the use of the shared context also requires it to be updated after the packet processing operation, then the context record needs to be protected from use by other engines, in particular the part of the context that may be changed as a result of such an update. Note: Crypto Packet Processor does not support the Context Type field. Therefore the value of these (CT) bits is reserved and should be set to ‘00’. RC – Reuse Context The RC field is not used and must always be ‘00’. Note, this field is also referred to as Context Reuse. ToO – Type of Output These bits must be set to reflect the behavior of any post-processing instructions that can be present in the token. These bits control the moment at which the Crypto Packet Processor allows the packet data to be read from the output buffer: in case one of the postprocessing instructions requires an update to the header of the packet (as would be the case for Authentication Header (AH) operations), then the packet must remain in the packet buffer until the complete packet has been processed. If the packet is larger than the Crypto Packet Processor internal packet buffer, then the update to the header is appended to the end of the packet data. In this case, the result token will signal type and amount of update data via the packet info fields, bits [31:22] of the 2nd output (result) token word. Note that the packet length field in the result token does not include this additional update data. If the token contains postprocessing instruction(s) requiring packet header updates, then one of the header update options must be set. It is allowed to use one of the header update options even if no header update instructions are present; this causes the Crypto Packet Processor to hold the packet in its internal buffer until it is completely processed. This allows the ToO bits to be set independently of the packet length at the expense of a slight performance hit. Note that when the Crypto Packet Processor is instructed to perform the header update in its internal RAM, but the packet is too large to fit into the Crypto Packet Processor output buffer, then the Crypto Packet Processor will still append the result to the packet. 380 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Input Token Definition Table D-4. Token Control Word (continued) Name Description ToO[1:0] 00 No header update. This setting is normally used if no header update instructions are present in the token. Note: If header update instructions are still present in the token, they are discarded by the Crypto Packet Processor and no header updates will take place. 01 Header update – small packets only This setting causes the Crypto Packet Processor to hang on to the result packet data, to attempt to update the packet header immediately. In case the packet is too large to fit into the Crypto Packet Processor internal result packet buffer, the Crypto Packet Processor starts writing out the packet data after the result packet has exceeded the size of 1792 bytes, and the Crypto Packet Processor will revert to appending the result data at the end of the packet data. Append result to packet. This setting prevents the Crypto Packet Processor from updating the packet header in its internal packet buffer. Writing out of the packet data can start as soon as enough data is available. The update data is appended to the end of the packet. 11 Header update and result appending. This setting is required if the token contains instructions for both a header update as well as result data to be appended to the packet. The Crypto Packet Processor will keep the packet data in its internal packet buffer (under the same conditions as listed for option ‘01’) until the ‘STAT’ bits of one of the postprocessing instructions (see "Instruction Format") indicates that the last header update instruction has been processed. Update data from any postprocessing instruction, following the ‘last header update’ postprocessing instruction, will be appended to the packet. Packet data output will commence after the ‘last header update’ bits have been seen. There is currently no practical use case for this setting. 10 ToO[2] Bit 2 of the Type of Output field is used to indicate pad removal options. This bit can only be set for inbound operations. If the bit is set, the padding type in the context record must be one of the following: PKCS#7, RTP, IPSec, TLS, or SSL. 0–– no pad removal 1–– remove and (optionally) verify pad The ToO bits affect the operation of the INSERT_REMOVE_RESULT and REPLACE_BYTE instructions. Refer to “Post-Process Instructions” on page 408. Accidental use of this bit for outbound operations can result in unwanted pad removal or an error situation, if no padding is found. C: Context Control Words Present in Token Setting this bit forces the Crypto Packet Processor to read the context control words (normally, the first two words of the context record) from the token (context control0 and context control1). The context control words in the token are formatted identically to the command words in the context record (see “Context Record Format” on page 433). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 381 Appendix D: Inline Packet Engine 2. PRNG 3. Input Token 4. Input Packet – indirectly via Input Token processing instruction Encryption Algorithm (Mode) IV Sources IV0 IV1 IV2 IV3 Token Control [28:26] IV[2:0] Context Control-1 [2:0] Crypto Mode [2:0] Context Control-1 [8:5] IV3/IV2/IV1/IV0 Context Control-1 [11:10] IV format Table D-1. IV Register Usage DES/3DESCBC Input packet1 Input packet1 not used not used 000 001 0000 00 PRNG PRNG not used not used 001 001 0000 00 Context record Context record not used not used 000 001 0011 00 Input packet1 Input packet1 Input packet1 Input packet1 000 110 0000 00 32’h00000001 000 010 0000 00 AES-CTR1 PRNG PRNG PRNG PRNG 001 110 0000 00 (nonce) Context record Input packet* Input packet* 32’h00000001 000 010 0001 01 PRNG PRNG 32’h00000001 001 010 0001 01 sequence number sequence number 32’h00000001 000 010 0001 10/ 11 Context record Context record Context record 000 110 1111 00 32’h00000001 000 010 0111 01 (nonce) Input token AES-ICM Input packet1 PRNG Context record 382 Input packet1 Input packet1 32’h00000001 100 010 0000 01 PRNG PRNG 32’h00000001 101 010 0000 01 sequence number sequence number 32’h00000001 100 010 0000 10/ 11 Context record Context record Input token 100 110 1110 00 32’h00000001 100 010 0110 01 Input packet1 000 111 0000 00 Input packet with 16’h0000 000 011 0000 00 PRNG 001 111 0000 00 PRNG with 16’h0000 001 011 0000 00 Context record 000 111 1111 00 Context record with 16’h0000 000 011 1111 00 Input packet1 PRNG Context record Input packet1 PRNG Context record Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Input Token Definition 1 Encryption Algorithm (Mode) IV Sources IV0 IV1 IV2 IV3 Token Control [28:26] IV[2:0] Context Control-1 [2:0] Crypto Mode [2:0] Context Control-1 [8:5] IV3/IV2/IV1/IV0 Context Control-1 [11:10] IV format Table D-1. IV Register Usage (continued) AES-CBC Input packet1 Input packet1 Input packet1 Input packet1 000 001 0000 00 PRNG PRNG PRNG PRNG 001 001 0000 00 Context record Context record Context record Context record 000 001 1111 00 The IV words loaded from the Input Packet are retrieved from the input data stream with a RETRIEVE instruction; please refer to D.5.4.7 “RETRIEVE Instruction” on page 403 for more details. IVs sourced from input packet are typically used for inbound (decrypt) type operations, where the IVs are sent along with the packet data (for example, ESP decapsulation). Internally generated IVs are typically used for outbound (encrypt) operations. AES-CTR is the underlying crypto algorithm for AES-GCM and AES-CCM. Therefore, the same IV loading mechanism applies. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 383 Appendix D: Inline Packet Engine Selecting IV Source The Crypto Packet Processor engine’s internal IV register (128 bits) is initially written using the IV fields (IV0 through IV3) of the context record during a context fetch operation. Each IV field can then be modified (overwritten) before the internal IV register is actually used with encrypt or decrypt operations. The IV field modification options are described in this section. In the Crypto Packet Processor, selecting IV field modification options is performed by using the following bits: 1. Three IV bits in the Input Token Control Word, bits [28:26]. 2. Two IV format bits of the Context Control Word 1, bits [11:10]. The IV bits of the input token control word provide for the options identified in Table D-1 below. Since the IV format options (bits [11:10] in Context Command Word 1) can modify the IV source further, please refer to "Summary of Possible IV Selection Result Scenarios" following this table. Note that in addition to the IV format field, the IV3 counter value can also be modified according to the Crypto Mode field setting, (bits [2:0] of Context Control Word 1), by automatically initializing this counter value to 32’h00000001 or 16’h0000. Refer to “Context Control Word 1 Definition” on page 439. Table D-1. Selecting IV Source Using Input Token Control Word 384 Input Token Control Word Bits [28:26] IV Field Source Modification (After Initial IV Fetch from Context Record) 000 No IV source modification is required. All IV fields reflect the context record values. No IV data taken from the input token or the PRNG. 001 Source all IV fields from the PRNG; if the PRNG is not ready, packet processing must be held. 010 Source IV3 field from the input token and keep the context record values for the other IV fields. 011 Source IV3 field from the input token and use PRNG for other IV fields. 100 Source IV0 and IV3 fields from the input token and keep the context values for the other IV fields. 101 Source IV0 and IV3 fields from the input token and use PRNG for other IV fields. 110 Source IV0 and IV1 fields from the input token and keep the context values for the other IV fields. 111 Source all four IV fields from the input token. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Input Token Definition Summary of Possible IV Selection Result Scenarios The figures below show how the two IV format bits of the Context Control Word 1, bits [11:10] are used with the three IV bits in the first dword of Input Token, bits [28:26] described above, to further modify the IV source selection, yielding eight possible Result IV scenarios. 128-bit IV register after context load IV0 / Nonce IV1 IV2 IV3 ‘00’: Full IV mode IV0 / Nonce IV1 IV2 IV3 ‘01’: Counter mode IV0 / Nonce IV1 IV2 IV3 128-bit IV register after Load IV option (000) with ‘IV format bits set to : From context seq. num. 1 seq. num. 0 byte swap ‘10’: Original sequence number mode IV0 byte swap Sequence number IV3 Incremented seq. num. 1 ‘11’: Incremented sequence Number mode (outbound only) seq. num. 0 byte swap IV0 byte swap Sequence number IV3 Figure D-9: Result IV Using IV [2:0] = 3'b000 with IV format [11:10] Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 385 Appendix D: Inline Packet Engine 128 -bit IV register after context load IV 0 / Nonce IV 1 IV 2 IV 3 128 -bit IV register after Load IV option (000) with ‘IV format bits set to : ‘00’: Full IV mode ‘01’: Counter mode PRNG output IV 0 / Nonce PRNG output IV 3 From context seq . num . 1 seq . num . 0 byte swap ‘10’: Original sequence number mode PRNG output byte swap Sequence number PRNG output Incremented seq . num . 1 ‘11’: Incremented sequence Number mode (outbound only) seq . num . 0 byte swap PRNG output byte swap Sequence number PRNG output Figure D-10: Result IV Using IV [2:0] = 3'b001 with IV Format [11:10] 386 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Input Token Definition 128-bit IV register after context load IV0 / Nonce IV1 IV2 IV3 ‘00’: Full IV mode IV0 / Nonce IV1 IV2 Token IV word ‘01’: Counter mode IV0 / Nonce IV1 IV2 Token IV word 128-bit IV register after Load IV option (000) with ‘IV format bits set to : From context seq. num. 1 seq. num. 0 byte swap ‘10’: Original sequence number mode IV0 / Nonce byte swap Sequence number Token IV word Incremented seq. num. 1 ‘11’: Incremented sequence Number mode (outbound only) seq. num. 0 byte swap IV0 / Nonce byte swap Sequence number Token IV word Figure D-11: Result IV Using IV [2:0] = 3'b010 with IV Format [11:10] Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 387 Appendix D: Inline Packet Engine 128 -bit IV register after context load IV 0 / Nonce IV 1 IV 2 IV 3 128 -bit IV register after Load IV option (000 ) with ‘IV format bits set to : ‘00’: Full IV mode ‘01’: Counter mode PRNG output IV 0 / Nonce Token IV word PRNG output Token IV word From context seq . num . 1 seq . num . 0 byte swap ‘10’: Original sequence number mode PRNG output byte swap Sequence number Token IV word Incremented seq . num . 1 ‘11’: Incremented sequence Number mode (outbound only ) seq . num . 0 byte swap PRNG output byte swap Sequence number Token IV word Figure D-12: Result IV Using IV [2:0] = 3'b011 with IV Format [11:10] 388 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Input Token Definition 128 -bit IV register after context load IV 0 / Nonce IV1 IV2 IV 3 token IV word IV1 IV2 2 token IV word IV1 IV2 2 128 -bit IV register after Load IV option (000 ) with ‘IV format bits set to : ‘00’: Full IV mode 1 ‘01’: Counter mode 1 st st nd nd token IV word token IV word From context seq . num . 1 seq . num . 0 byte swap ‘10’: Original sequence number mode byte swap 1 st token IV word Sequence number 2 nd token IV word Incremented seq . num . 1 seq . num . 0 byte swap byte swap ‘11’: Incremented sequence Number mode (outbound only) 1 st token IV word Sequence number 2 nd token IV word Figure D-13: Result IV Using IV [2:0] = 3'b100 with IV Format [11:10] 128 -bit IV register after context load IV 0 / Nonce IV1 IV2 IV 3 128 -bit IV register after Load IV option (000) with ‘IV format bits set to: ‘00’: Full IV mode token IV word PRNG output token IV word ‘01’: Counter mode token IV word PRNG output token IV word From context seq . num . 1 ‘10’: Original sequence number mode token IV word seq . num . 0 Sequence number token IV word ‘11’: Reserved Figure D-14: Result IV Using IV [2:0] = 3'b101 with IV Format [11:10] Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 389 Appendix D: Inline Packet Engine 128 -bit IV register after context load IV 0 / Nonce IV1 IV2 IV 3 128 -bit IV register after Load IV option (000 ) with ‘IV format bits set to: ‘00’: Full IV mode token IV word token IV word IV2 IV 3 ‘01’: Counter mode token IV word token IV word IV2 IV 3 From context seq . num . 1 seq . num . 0 byte swap ‘10’: Original sequence number mode token IV word byte swap Sequence number IV 3 Incremented seq . num . 1 ‘11’: Incremented sequence Number mode (outbound only ) seq . num . 0 byte swap token IV word byte swap Sequence number IV 3 Figure D-15: Result IV Using IV [2:0] = 3'b110 with IV Format [11:10] 390 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Input Token Definition 128 -bit IV register after context load IV 0 / Nonce IV1 IV2 IV 3 128 -bit IV register after Load IV option (000 ) with ‘IV format bits set to : ‘00’: Full IV mode token IV word token IV word token IV word token IV word ‘01’: Counter mode token IV word token IV word token IV word token IV word From context seq . num . 1 seq . num . 0 byte swap ‘10’: Original sequence number mode token IV word byte swap Sequence number token IV word Incremented seq . num . 1 seq . num . 0 byte swap ‘11’: Incremented sequence Number mode (outbound only ) token IV word byte swap Sequence number token IV word Figure D-16: Result IV Using IV [2:0] = 3'b111 with IV Format [11:10] U: Upper Layer Header from Token If this bit is set, the checksum register in the Crypto Packet Processor context space is written with the checksum value supplied from the token. The checksum value in context space can then be used to either, compare against the checksum in the packet header or it can be inserted in the packet header. See D.5.4.3 “INSERT Instruction” on page 398 and D.5.4.7 “RETRIEVE Instruction” on page 403. Input Packet Pointer (token dword [1], Required) Must be 0. Output Packet Pointer (token dword [2], Required) Must be 0. Context Control Words [1:0] (token dwords [5:4], Optional) Optionally, the Context Control Words [1:0] can be placed in the token at dword location [5:4], as specified by setting the C field in the first dword of a token header. Refer to “C: Context Control Words Present in Token” on page 381. IV [3:0] (Optional) Optionally, IV [3:0] can be placed in the input token at dword locations [9:6]. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 391 Appendix D: Inline Packet Engine Processing Instructions (One Instruction Minimum) Refer to the following D.5 Processing Instructions, for a detailed discussion and definition of all Processing Instructions supported by the Crypto Packet Processor. Bypass Token Data (Optional) This field contains data that must be bypassed from the input token to the result token and can have any length between 0 and 4 dwords. The first bypass data word must contain the bypass opcode. D.5 Processing Instructions D.5.1 Instruction Types There are five types of processing instructions. The first 2 types (Operational Data and IP Header Instructions) are executed by the Crypto Packet Processor preprocessor and can occur in any order. These preprocessor instructions create the different data streams for crypt, hash, and output. The third type executes on the result data stream by the Crypto Packet Processor post-process module. Instructions of Type 3 can be mixed with the Type 1 and 2 instructions. Type 3 instructions can modify or append result data fields in the output packet. Note: Types 1, 2, and 3 instructions are also referred to as execution instructions, since these are the only instructions “executed” by the pre and post-processors. Type 4 instructions are executed by the Crypto Packet Processor control module and can only occur after Types 1, 2, and 3 instructions. Type 5 instructions are also executed by the Crypto Packet Processor control module and can occur before or after the Types 1, 2, and 3 instructions. Type 4 and 5 instructions are executed in the control module and do not affect the result data. Only context records and result tokens contain the results of these instructions. The following subsections group the processing instructions under their appropriate instruction type. Also shown are the key fields used by each instruction. D.5.1.1 Operational Data Instructions (Type 1) The following operational data instructions are executed by preprocessor: 1. DIRECTION 2. PRE_CHECKSUM 3. INSERT 4. INSERT_CTX 5. REPLACE 6. RETRIEVE 7. MUTE D.5.1.2 IP Header Instructions (Type 2) The following IP header instructions are executed by preprocessor: 1. IPV4_CHECKSUM 2. IPV4 392 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions 3. IPV6 D.5.1.3 Post-Process Instructions (Type 3) After packet processing, the Crypto Packet Processor can perform the following instructions: 1. INSERT_REMOVE_RESULT 2. REPLACE_BYTE D.5.1.4 Result Instructions (Type 4) There is currently only one instruction of this type: VERIFY_FIELDS D.5.1.5 Context Control Instructions (Type 5) There is currently only one instruction of this type: CONTEXT_ACCESS D.5.1.6 Special Instructions (Type 6) There is currently only one instruction of this type: BYPASS D.5.2 Instruction Sequencing Token processing instructions are restricted to the sequencing shown in Table D-1. Table D-1. Instruction Sequencing Sequencing of Instructions Context Control Instructions with the ‘result type’ field equal to ‘00’ (See Note 2) Execution Instructions (Instruction Types 1, 2 and 3)1 Context Control Instructions with the ‘result type’ field equal to ‘00’2 (See Note 2) Result Instructions (Instruction Type 4) Context Control Instructions (Instruction Type 5) 1 2 Execution instructions are all instructions executed by the pre/post-processor ‘Result type’ = ‘00’ refers to Context Control Instructions that always execute, and therefore have their F and P instruction fields set to ‘0’. (Refer to “Context Control Instructions” on page 425. D.5.2.1 Sequencing Rules The following rules apply when using instructions: • Context Control Instructions can never be mixed with the execution instructions (Types 1, 2, and 3). • The Result Instructions must be located after the execution instructions. • A token can contain only one Result Instruction. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 393 Appendix D: Inline Packet Engine • Context Control instructions that occur after execution instructions and before the Result Instruction, must have field result type equal to 00’ (execute always). Refer to “Context Control Instructions” on page 425. D.5.3 Instruction Format The general format for all token processing instructions is described in this section. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode instruction dependent fields STAT length / offset data (optional) Table D-2. Instruction Format Bits Name Description 31:28 opcode Each instruction has a unique operation code (opcode).Table D-3 provides a complete list of instruction operation codes. The opcodes for reserved instructions can not be used. 27:19 Instruction dependent fields The definition of fields within bits [27:19] is dependent on the instruction used. Therefore, for details on how these bits are used, refer the specific instruction in D.5 Processing Instructions. 18:17 STAT STAT Definition for Type 1 and 2 Instructions. The STAT field is used to make sure cryptographic and authentication operations are completed. As long as the status bits are 0, all data streams expect to receive more data after execution of this instruction. If the hash status bit (bit 17) is set, the current instruction passes the last hash data. If the last data bit (bit 18) is set, the current instruction has operated on the last data bit from the datapath through the packet-processing engine. Encoding of the STAT field for Type 1 and 2 instructions: 00 processing 01 last hash data for hash engine 10 last data for packet processing 11 last hash data for hash engine and last data for packet processing STAT Definition for Type 3 Instructions. For the Insert result (opcode: 1011) and replace (opcode: 1100) instruction these bits will be used differently: If the checksum status bit (bit 17) is set, the modification does not modify the checksum calculated by the postprocessing. If the last insert bit (bit 18) is set, this instruction will be the last instruction that inserts data into the data stream. After execution of this instruction, it is not required for this packet to hold data in the packet buffer. Encoding of the STAT field for postprocessing instructions: 00 checksum modification and not last INSERT instruction 01 no checksum modification required 10 last insert header instruction; remaining post-process instructions are append data 11 no checksum modification required and last ‘insert header’ instruction The use of the last insert bit (for postprocessing instructions) is optional; it can improve the performance. 394 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions Table D-2. Instruction Format (continued) Bits Name Description 16:00 length/offset Length Pointer. length If a length field applies, indicates the number of bytes that are to be processed by the instruction. For Type 1 instructions, the length field must be greater than 0. offset If an offset field applies, it is the offset into the output stream. All data bytes are written into the data stream starting at this offset position. 31:00 data (Optional) The data appended to an instruction must be specified in little-endian format. All data words to be inserted are concatenated after the instruction. The length field indicates the number of bytes that needs to be inserted into the data stream, which equals the data appended after the instruction in the token. Table D-3. Instruction Operation Codes Operation Code Instruction Instruction Type 0000 DIRECTION Operational Data Instruction (Type 1) 0001 PRE_CHECKSUM 0010 INSERT 1001 INSERT_CTX 0011 REPLACE 0100 RETRIEVE 0101 Mute 0111 IPV4 (with checksum untouched) 0110 IPV4_CHECKSUM 1000 IPV6 1010 INSERT_REMOVE_RESULT 1011 REPLACE_BYTE 1100 reserved n/a 1101 VERIFY Result Instruction (Type 4) 1110 CONTEXT_ACCESS Context_Control Instruction (Type 5) 1111 BYPASS_TOKEN_DATA Special Instruction (Type 6) IP Header Instruction (Type 2) Post-processing Instruction (Type 3) Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 395 Appendix D: Inline Packet Engine D.5.4 Operational Data Instructions D.5.4.1 Direction Instruction This instruction does not modify any data; it only passes a number of bytes to the crypto engine, hash engine, output buffer, or combinations of these. The length field (in bytes) indicates the amount of packet data to be transferred by this instruction. The t.o.dest. field determines the destination(s). The L bit can be used to indicate the last block for the crypto or hash engines; this bit can optionally be used to improve performance in the case of a block cipher and must be used in case of stream ciphers and (GHASH) GCM. Note: Token instructions sending data to the crypto engine must be following each other directly, that is, the input to the crypto engine must be provided as a single, uninterrupted, block of data. There cannot be instructions ‘in between’ sending data to destinations, other than the crypto engine. The crypto engine cannot handle any data alignment issues that may occur if intermediate data handling instructions occur between two instructions targeting the crypto engine. DIRECTION 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 0 0 0 L t.o. dest. reserved – – – – – – – – – STAT length – – – – – – – – – – – – – – – – – – – Table D-4. DIRECTION Definition Bits Name Description 31:28 opcode Each instruction has a unique operation code (opcode).Table D-3 provides a complete list of instruction operation codes. The opcodes for reserved instructions can not be used. Refer to "opcode" for a description of this field. 27 L (Last) When the ‘L’ bit is set, and the crypto data bit (bit 26) in the ‘t.o dest.’ field is also set, then this is the last crypto data block. In the case of CTR or ICM mode, the ‘L’ bit must be set for the last block; for all other modes this bit is optional. When the ‘L’ bit is set, and the hash bit (bit 25) in the ‘t.o dest.’ field is also set, while the crypto bit (bit26) is NOT set, then this is the last hash data before crypto data. The ‘L’ bit is only required for a GHASH (GCM) operation where the AAD data needs to be hashed in a separate operation before the crypto data is hashed. 396 26:24 t.o.dest. Type of destination. See Table D-5 below. 23:19 reserved reserved. 18:17 STAT Refer to STAT field in Table D-2 for a description of this field. 16:00 length Refer to "length" on 395 for a description of this field. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions Table D-5. Type of Destination Field Description Value Type of Destination Use Example 000 reserved n/a 001 output only (for bypass data) IP header 010 hash engine only extended sequence number 011 hash engine and output protocol header 100 crypto engine only n/a 101 crypto engine and output encrypt only payload 110 crypto engine and hash engine n/a 111 crypto engine, hash engine and output ESP payload D.5.4.2 PRE_CHECKSUM Instruction The PRE_CHECKSUM instruction is used to update 16-bit checksums during preprocessing. This instruction is typically used in performing NAT operations where checksum updating is required to reflect updated IP addresses or port numbers. PRE_CHECKSUM 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 0 0 1 L t.o. cmd. origin – 1 0 1 1 1 0 1 0 STAT checksum update value 0 – 0 – – – – – – – – – – – – – – – – Table D-1. PRE_CHECKSUM Instruction Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L (Last) The description for the ‘L’ field is the same as in “Direction Instruction” on page 396. Note that when PRE_CHECKSUM is used with IP protocol, L is always set to ‘0’, since the checksum is never last. 26:24 t.o.cmd. (type of command) This field must be set to ‘111’. 23:19 origin The origin field must be set to ‘01010’. 18:17 STAT Refer to Table D-2 for a description of this field. 16:00 checksum update value This value must be the difference between the old checksum value and the new desired checksum value. The checksum update value is generated as follows: XOR original values with 0xFFFF (invert). Use this result to perform a 16-bit ‘ADD with carry’ with the ‘checksum update value’ field of the PRE_CHECKSUM instruction. The result of this calculation is then XOR’d with 0xFFFF and replaces the original value in the data stream. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 397 Appendix D: Inline Packet Engine D.5.4.3 INSERT Instruction The INSERT instruction is used to insert data into the data stream before the data enters the packet processing module. The INSERT instruction inserts data from the internal register bank, such as the context registers or data directly from the token following the instruction (also referred to as an INSERT immediate). The data source to be inserted is selected by the origin field. Note that in Table D-2, some of the internal registers are ordered to allow multiple fields to be inserted with one instruction. For example, the registers highlighted in yellow (SPI, sequence number result, and IV0 through IV3) are typically inserted with one instruction. In addition to inserting data into the data stream, the INSERT instruction can also insert one of several types of padding, zero padding, PKCS#7, Constant, RTP, IPSec, TLS, SSL, or TFC. Padding is applied if the two most significant bits of the origin field are set to ‘00’. INSERT 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 0 1 0 L t.o. dest. padding/origin – – – – – – – – – STAT extended length / insert value length – – – – – – – – – – – – – – – – – – – Table D-2. INSERT Instruction Definition 398 Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L (Last) The description for the ‘L’ field is the same as in “Direction Instruction” on page 396. 26:25 t.o.dest. - type of destination The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions Table D-2. INSERT Instruction Definition (continued) Bits Name Description 23:19 padding/origin The INSERT instruction uses of the padding/origin field for one of two possible purposes, padding or origin. 1. Padding Use: MSbs = ‘00’ If the two MSbs are ‘00’, then the three LSBs indicate the type of padding to be used. For Constant, RTP, SSL, IPSec padding: bits [16:9] contain the value that needs to be inserted. In case of Constant, RTP and SSL padding, the insert value represents the constant padding value. The location of the inserted value is indicated by the value ‘99’ in the examples below. For TFC, the full 17 bits are used to indicate the length. Possible padding values for this field are listed in the following table: Padding Type Padding Sequence 00000 zero padding 00-00-00-00-00-00 Note: This padding type can also be used to insert a number of ‘0’ bytes (1 to 511) into the data stream. 00001 PKCS#7 06-06-06-06-06-06 00010 Constant 99-99-99-99-99-99 00011 RTP 99-99-99-99-99-06 00100 IPSec1 01-02-03-04-04-99 (in this case, ‘99’ should contain the NH value from 00101 TLS 00110 SSL 00111 TFC the original datagram) 05-05-05-05-05-05 99-99-99-99-99-05 00-00-…-00-00-…00-00 The ‘extended length’ field applies to this pad- ding selection. 1 18:17 STAT Refer to the definition for the STAT field in Table D-2. 16:09 extended length / insert value Depending on the value of the padding/origin field, this field is interpreted as either the ‘extended length’ (the upper 8 bits of the length value) or the ‘insert value’ (the value to be used for certain padding modes, for example Constant, RTP, and SSL. 8:00 length Refer to "length" on 395 for a description of this field. Note that when padding is inserted, this field indicates the total number of the padding bytes. Example: To add 14 IPSec padding bytes, length field and next header, specify the length value of 16 (0x10 in hexadecimal). To add 20 bytes of SSL or TLS padding and length field, specify the length value of 21 (0x15 in hexadecimal). To create a packet that uses IPSec padding filled with zeros (00-00-00-00-04-NH), two token instructions should be used: a) INSERT the SSL pad (length is 4+1=5 bytes), to insert the sequence (00-00-00-00-04) INSERT 1 byte from token, where 1 byte is a NH value. b) Origin Use: MSbs = not ‘00’. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 399 Appendix D: Inline Packet Engine Possible origin values for this field are listed in Table D-3. Table D-3. ‘Origin’ Field Encoding for Preprocessing Instructions ‘origin’ Field Value INSERT Instruction — Internal Register(s) to be Inserted R/W Status 01000 seq_num result – copy of 10011 For outbound, this is the incremented result of the sequence number from context. R/W 01001 extended sequence number result For inbound, this is an estimation based on sequence number from context and retrieved sequence number data. For outbound, this is the incremented result of the 64-bit sequence number from context. Note: This origin value should be used for inserting the IPSec extended sequence number into the data stream for use by the hash engine. R/W 01010 seq_num (from context) R 01011 extended sequence number (from context) R 01100 seq_num (from context) – copy of 01010 R 01101 - 01111 reserved - 10000 General purpose register 0 R/W 10001 General purpose register 1 R/W 10010 SPI If length is 8, SPI and seq_num result is inserted. If length is 16 or 24, SPI, seq_num result, and IV are inserted. W – SPI result R – SPI active 10011 seq_num result R/W 10100 IV0 – first IV word R/W 10101 IV1 – second IV word R/W 10110 IV2 – third IV word R/W 10111 IV3 – fourth IV word R/W 11000 SPI result register R 11001 checksum – 16-bit checksum value optionally located in the 16 LSBs of input token. R/W 11010 checksum calculation store R/W 11011 Indicates that this is an ‘INSERT (immediate)’ instruction. See “INSERT Instruction” on page 398 Example- INSERT (immediate) Table D-4. - 11100 hash result digest from hash engine. Note: Can be any length up to 16 dwords (512 bits). R/W 11101 - 11111 reserved - If the length exceeds the fields of one origin type, it will continue with the next field that is located on the next origin location. Be aware that after the fourth hash word, the fifth up to the 16th hash-word are read. 400 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions INSERT (immediate) The INSERT (immediate) instruction inserts data immediately following the instruction into the data stream. It is identified by the value ‘5b11011’ in the origin field. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 0 1 0 L t.o. dest. origin – – 1 – – 1 0 1 1 STAT length – 0 – 0 0 0 0 0 0 0 – – – – – – – – – data data Table D-4. INSERT (immediate) Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L (Last) The description for the ‘L’ field is the same as in “Direction Instruction” on page 396. 26:24 t.o.dest. - type of destination The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396. 23:19 origin The value of ‘0b11011’ indicates that this is an INSERT (immediate) instruction. 18:17 STAT Refer to Table D-2, “Instruction Format,” on page 394 for a description of this field. 16:00 length Refer to "length" on 395 for a description of this field. 31:00 data Refer to “Processing Instructions” on page 392, for a description of this field. D.5.4.4 INSERT Instruction Example – NOP The following INSERT instruction can be used as NOP or dummy instruction. NOP 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 0 1 0 L t.o. dest. padding/origin 0 0 0 0 0 0 0 0 0 STAT reserved – 0 – 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 This instruction does not process any data and takes exactly one clock cycle to process. Please note that the sequencing rules must still be respected and that since the t.o.dest is 0 (the crypto-bit is not set) the cryptographic data steam may not be interrupted with this NOP instruction. Table D-5. NOP Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L (Last) The description for the ‘L’ field is the same as in “Direction Instruction” on page 396. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 401 Appendix D: Inline Packet Engine Table D-5. NOP Definition (continued) Bits Name Description 26:24 t.o.dest. - type of destination The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396. 23:19 origin The value of ‘0b11011’ indicates that this is an INSERT (immediate) instruction. 18:17 STAT Refer to Table D-2, “Instruction Format,” on page 394 for a description of this field. 16:00 reserved Reserved. D.5.4.5 INSERT_CTX Instruction This instruction is intended to insert: 1. Data from token to the output stream, and at the same time, write these data to the context record. 2. Data from context record to the output stream, and at the same time, write these data to another location in context record. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 0 0 1 L t.o. dest. origin – – – – – – – – – STAT context dest. – – – – – – – reserved length 0 0 0 0 – – – – – – – – Table D-1. INSERT_CTX Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L (Last) The description for the ‘L’ field is the same as in “Direction Instruction” on page 396. 26:24 t.o.dest Destination in the datapath. 23:19 origin Indicates origin of data. 18:17 STAT Refer to "STAT" for a description of this field. 16:12 context_dest Indicates destination in the context record. When this field is 0, no write to the context record is performed. 11:9 reserved reserved. 8:00 length Refer to "length" on 395 for a description of this field. Note: When using the INSERT_CTX instruction to read fields from the context that were updated by a previous instruction, another instruction must used before the INSERT_CTX instruction. This can be any instruction or simply the NOP instruction. This assures the Crypto Packet Processor has time to commit the updated context values before they can be read. 402 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions D.5.4.6 REPLACE Instruction The REPLACE instruction operates similarly to the INSERT instruction, except that it overwrites the input data in the data stream instead of inserting it. The REPLACE instruction is used to overwrite data in the data stream before the data enters the packet processing module. REPLACE 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 0 1 1 L t.o. dest. origin – – – – – – – – – STAT length – 0 – 0 0 0 0 0 0 0 – – – – – – – – – Table D-2. REPLACE Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L (Last) The description for the ‘L’ field is the same as in “Direction Instruction” on page 396. 26:24 t.o.dest. - type of destination The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396. 23:19 origin Refer to the origin field description in Table D-3. Origin values 00000-00111, 01100, 01111 and 11101-11111 are not applicable for the REPLACE instruction. Two additional origins are available for this instruction only. 01101 Increment current value, one byte from the input data stream is incremented with ‘1’. "length" on 395 field must be set to 1 (17’h00001) using.this origin. 01110 Decrement current value, one byte from the input data stream is decremented with ‘1’. "length" on 395 field must be set to 1 (17’h00001) using.this origin. 18:17 STAT Refer to Table D-2, “Instruction Format,” on page 394 for a description of this field. 16:00 length Refer to "length" on 395 for a description of this field. D.5.4.7 RETRIEVE Instruction The purpose of the RETRIEVE instruction is to retrieve length bytes of data from the input data stream, starting from the point where the previous instruction stopped processing, and send this data to a different location or simply remove it from the data stream. Note that these instructions differ from the INSERT_REMOVE_RESULT and instruction that modify the output data stream (see “Post-Process Instructions” on page 408). RETRIEVE 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 1 0 0 L t.o. dest. origin – – – – – – – – – STAT length – 0 – 0 0 0 0 0 0 0 – – – – – – – – – The following subsections show three examples of the RETRIEVE instruction. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 403 Appendix D: Inline Packet Engine Table D-3. RETRIEVE Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L (Last) The description for the ‘L’ field is the same as in “Direction Instruction” on page 396. 26:24 t.o.dest. type of destination: The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396. 23:19 origin Refer to the origin field description in Table D-3. Origin values 00000-00111, 01100, 01111 and 11101-11111 are not applicable for the RETRIEVE instruction. 18:17 STAT Refer to Table D-2, “Instruction Format,” on page 394 for a description of this field. 16:00 length Refer to "length" on 395 for a description of this field. RETRIEVE (Copy, Store, and Pass) Instruction Example This RETRIEVE instruction copies data from the input data stream, stores the data in context registers (destination register controlled by the origin field) and then passes the same data to the processing engine (under control of the ‘type of destination’ (ToP) field). 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 1 0 0 L t.o. dest. origin – – – – – – – – – STAT length – 0 – 0 0 0 0 0 0 0 – – – – – – – – – RETRIEVE (Remove and Store) Instruction Example This RETRIEVE instruction removes data from the input data stream and only stores data in the context registers. Note that the L and t.o.dest. fields are set to all zeroes. The origin field is used to indicate which context registers need to be overwritten. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 1 0 0 L t.o. dest. origin 0 0 – 0 0 – – – – STAT length – 0 – 0 0 0 0 0 0 0 – – – – – – – – – RETRIEVE (Remove Only) Instruction Example This RETRIEVE instruction simply removes data from the input data stream and does not store the data or pass the data to the processing engine. The L and t.o. dest. fields are set to zeros and origin field is set to 11011). 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 1 0 0 L t.o. dest. origin 0 0 1 0 0 1 0 1 1 STAT length – – – – – – – – – – – – – – – – – – – D.5.4.8 MUTE Instruction The MUTE instruction performs a bitwise AND of the input data from the packet with the mask data immediately following the MUTE instruction. The MUTE instruction typically sends data to two destinations at the same time. This first destination specified by the t.o. dest. field receives the muted version of the data, and the second destination specified by the t.o.dest 2 field receives the original, unmuted data. Typically the t.o. dest. destination is always set to ‘010’ (hash engine) and the t.o.dest 2 destination is always set to ‘001’ (the output data stream). 404 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions The length field indicates the number of bytes to be muted, up to 508 bytes. The specified length must be a multiple of four bytes, (bits [1:0] = ‘00’). MUTE 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 1 0 1 r t.o. dest. r t.o. dest. 2 M STAT length 0 0 0 0 – 0 – – – – 1 – 0 0 0 0 0 0 0 – – – – – – – 0 0 mask0 …. mask(n-1) Table D-4. MUTE Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 r Reserved, set to ‘0’. 26:24 t.o.dest.2 – type of destination 2 The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396, except destination to the crypto engine is not allowed (bit [22] must be set to ‘0’). Typically this field is always set to ‘001’ (output). This destination receives the original, unmuted data. 23 r 22:20 t.o.dest. - type of destination The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396, except destination to the crypto engine is not allowed (bit [26] must be set to ‘0’). Typically, this field is always set to ‘010’ (hash engine).This destination receives the muted version of the data. 19 M Indicates that the mask is appended to the instruction. If M is set to 0, no mask fields are supplied and the instruction assumes that ‘length’ bytes of zeros are to be used as mask values. M = ‘0’ could be used with IPv6 extension headers that contain complete 32-bit words to be muted. 18:17 STAT Refer to Table D-2, “Instruction Format,” on page 394 for a description of this field. 16:00 length The "length" on 395 field indicates the number of bytes to be muted, up to 508 bytes. The specified length must be a multiple of four bytes, (bits [1:0] = ‘00’). 32:00 mask The mask data immediately following the MUTE instruction to be used in the bitwise AND operation with the input packet data. Refer to “data (Optional)” on page 395. Reserved, set to ‘0’. Note: If the same value is set for both destination fields, the MUTE instruction will send both muted and unmuted data to the same destination. D.5.5 IP Header Instructions There are three IP header instructions. • IPv4 (with checksum cleared) • IPv4_CHECKSUM (checksum updated) • IPv6 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 405 Appendix D: Inline Packet Engine All instructions involve only the first words of the IP header. In the case of IPv4, the first three words are processed by the IP Header instruction, while in the case of IPv6 only the first 2 words are processed. D.5.5.1 IPv4 Instruction The IPv4 instruction processes the first three words of the IPv4 header and passes them according to the destination specified in t.o.dest. field. It inserts the length and protocol fields of the instruction into the corresponding fields in the IP header. It also clears the IPv4 checksum to 0 and if the D field of the instruction is set to 1, decrements the IPv4 time to live field. Table D-5 illustrates which IPv4 header fields are inserted, updated, or untouched. IPv4 (With Checksum Untouched) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 0 1 1 1 D t.o. dest. protocol – – – – – – length – – – – – – – – – – – – – – – – – – – – – – Table D-5. IPv4 (with Checksum Untouched) Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 D – ‘time to live’ decrement The IPv4 ‘time to live’ value is decremented when ‘D’ is set to ‘1’. 26:24 t.o.dest. Destination for the processed header words. The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396. 23:16 protocol The value to be placed in the ‘protocol’ field of the IPv4 header. 15:00 length The value (in little-endian format) to be placed in the ‘length’ field of the IPv4 header. The IPv4 instruction takes care of the byte swapping. Modifications 4-bits 32-bits version IHL type of service identifier time to live length flags protocol replaced with the value from the instruction fragment offset checksum updated Figure D-17: IPv4 (with Checksum Untouched) Instruction: Datagram Modifications Refer to Table D-3. 406 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions D.5.5.2 IPv4_CHECKSUM Instruction IPv4 checksum instruction processes the first three words of the IP header and passes them according to the t.o.dest. field. It inserts the length and protocol fields of the instruction into the corresponding fields in the IP header. It also updates the IPv4 checksum and if the D field of the instruction is set to ‘1’, decrements the IPv4 ‘time to live’ field. Figure D-20 shows which IPv4 header fields are inserted or updated. IPv4_CHECKSUM 31 30 29 28 27 26 25 opcode D t.o. dest. – – 0 1 1 0 – 24 23 22 21 20 19 18 17 16 protocol – – – – 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 length – – – – – – – – – – – – – – – – – – – – – Table D-6. IPv4_CHECKSUM Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 D – ‘time to live’ decrement The IPv4 ‘time to live’ value is decremented when ‘D’ is set to ‘1’. 26:24 t.o.dest. Destination for the processed header words. The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396. 23:16 protocol The value to be placed in the ‘protocol’ field of the IPv4 header. 15:00 length The value (in little-endian format) to be placed in the "length" on 395 field of the IPv4 header. The IPv4 instruction handles the byte swapping. Modifications 4-bits 32-bits version IHL type of service identifier time to live length flags protocol fragment offset checksum replaced with the value from the instruction updated Figure D-18: IPv4_CHECKSUM Instruction: Datagram Modifications D.5.5.3 IPv6 Instruction IPv6 instruction processes the first 2 words and passes them according to the destination specified by the t.o.dest. field. It inserts the instruction’s payload length and next header fields in the corresponding fields of the IPv6 header. The D field determines if the IPv6 ‘hop limit’ value is decremented. Figure D-19 illustrates which fields of the IPv6 header are removed, inserted, or updated. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 407 Appendix D: Inline Packet Engine IPv6 31 30 29 28 27 26 25 opcode D t.o. dest. – – 1 0 0 0 – 24 23 22 21 20 19 18 17 16 next header – – – – – 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 payload length – – – – – – – – – – – – – – – – – – – – Table D-7. IPv6 Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 D – ‘hop limit’ decrement The IPv6 ‘hop limit’ value is decremented when ‘D’ is set to ‘1’. 26:24 t.o.dest. Destination for the processed header words. The values for the ‘t.o.dest.’ field are the same as those listed in “Direction Instruction” on page 396. 23:16 next header The value to be placed in the ‘next header’ field of the IPv6 header. 15:00 payload length The value (in little-endian format) to be placed in the ‘payload length’ field of the IPv6 header. The IPv4 instruction takes care of the byte swapping. Modifications 4-bits 32-bits version traffic class flow label payload length next header replaced with the value from the instruction hop limit optionally decremented Figure D-19: IPv6 Instruction: Datagram Modifications D.5.6 Post-Process Instructions Post-processing instructions are used to modify packet data after the packet has been processed by the packet processing module. Typically this includes updating packet headers, for example, inserting the ICV in the AH header or adding/removing data from the packet. Post-processing instructions are executed by the post-processing module. There are two post-process instructions, the INSERT_REMOVE_RESULT instruction and the REPLACE_BYTE instruction. The REPLACE_BYTE instruction can be used to replace a single byte in the output data stream with a value from the instruction. Please note that both instructions have limited functionality in the Crypto Packet Processor configurations due to the minimal sized output buffer. These two instructions can only append result data in the Crypto Packet Processor, where the two instructions can also update fields in the packet in the standard Crypto Packet Processor. 408 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions D.5.6.1 INSERT_REMOVE_RESULT (IRR) Instruction The INSERT_REMOVE_RESULT instruction (also referred to as the IRR instruction) can be used as an INSERT or REMOVE operation. It can take result data from the hash result registers or the status registers and INSERT (actually replace) this data in the appropriate fields of the output data stream or append it to the end of the data stream. This same instruction can also be used to REMOVE a hash compare value from the decrypted data stream starting from any byte. The IRR instruction functions as a remove_result operation when all of the following fields are set to ‘0’: L, NH, CS and P. Otherwise, this instruction functions as an insert_result operation for updating the length, next header, and checksum fields in IPv4 or IPv6 headers (refer to “INSERT_REMOVE_RESULT Instruction (remove_result Operation)” on page 411). The IRR instruction is typically used to update certain fields in the output packet with the new values. The common “use case” is the insertion of the ICV when performing AH outbound operations or updating the IP header fields when performing ESP inbound (transport mode) operations. However, in the Crypto Packet Processor configurations, the IRR instruction is not able to perform updates of the header data in the output buffer. This logic is disabled for the Crypto Packet Processor configurations because of the minimum sized data output buffer. Instead, the input token must use a special mode of this instruction to append the data for update to the end of the packet. A module external to the Crypto Packet Processor must inspect the output token, which contains information about the data that is appended. This module must be aware of the protocol ‘use case’ and must replace the required fields using information from the appended data. These actions are similar to those for “jumbo”-frames in the standard Crypto Packet Processor. Please refer to “Inline Packet Engine — Token Examples” on page 491 for the specific AH outbound and ESP inbound transport mode tokens that can be used within the Crypto Packet Processor configurations. Using the IRR instruction to remove an encrypted block of zeroes from the output buffer (used in AES-GCM/GMAC/CCM modes) is not affected in the Crypto Packet Processor and remains the same. Ordering and alignment of appended data to the output data stream: When using the IRR instruction to append updated IP header fields to the end of the output data stream, the field order and byte alignment is fixed. When more than one of the L, NH, CS fields are to be appended to the data stream, the updated fields (if selected) will always be appended in the following order: length field if selected, then next header field if selected, and finally the checksum field if selected. These fields are always appended as 32-bit dwords with byte alignment within the dword as follows for IPv4: MSB Alignment of Appended IP Header Fields (IPv4) LSB 1st appended dword if selected 16’h0000 ‘total length’ (IPv4) dword bits [15:0] 2nd appended dword if selected 16’h0000 ‘protocol’ (IPv4) dword bits [15:8] 3rd appended dword if selected 16’h0000 ‘checksum field’ (IPv4 only) dword bits [15:0] Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 8’h00 409 Appendix D: Inline Packet Engine For all other cases (IPv6) the alignment is as follows: MSB 410 Alignment of Appended IP Header Fields (IPv6) 1st appended dword if selected 16’h0000 ‘payload length’ (IPv6) dword bits [15:0] 2nd appended dword if selected 16’h0000 8’h00 LSB ‘next header’ (IPv6) dword bits [7:0] Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions INSERT_REMOVE_RESULT Instruction (remove_result Operation) This section describes the INSERT_REMOVE_RESULT instruction functioning as a remove_result operation. The remove_result operation is specifically used for the removal of the hash result or an encrypted value (see description of bit 17 of the Context Control Word1 below) from the output data stream, allowing for the subsequent use of the removed value by the VERIFY_FIELDS instruction. This differs from the RETRIEVE instruction, which can only extract data from the input data stream. Note that the postprocessing module can only buffer one instruction at a time; execution of the current post-process instructions must be completed before the next instruction can be passed to the postprocessing module. Therefore, the order of these instructions within the token is important, specifically in case of the remove_result operation in combination with additional postprocessing instructions. The remove_result operation needs to wait in the postprocessing module for the actual data to be removed before passing through this module. The control module continues processing instructions from the token, but if an additional postprocessing instruction is encountered, further instruction processing in the control module is halted until the postprocessing module has successfully finished with its current remove_result operation. Only then can the postprocessor module accept the new postprocessing instruction, allowing the control module to continue processing. When constructing tokens, also be aware of the following. The remove_result operation contains an offset to a location in the packet data stream. If the remove_result operation is provided to the postprocessing module after this offset location in the data stream has already passed through the postprocessing module, (for example, due to other instructions blocking further instruction processing), then this outdated remove_result operation will never trigger. This will prevent the Crypto Packet Processor from completing the operation on the current token, which will cause the Crypto Packet Processor to stop processing packets, until eventually the timeout counter triggers and a result token with error code E14 (refer to “Result Token Definition” on page 429) is returned. Note that for this error to occur, the timeout counter must be activated. The remove_result operation is context sensitive in the sense that the behavior of the operation changes depending on the context of the packet currently being processed. The context bit in question is the encrypt hash result bit, bit [17] of context control word 1 from the packet context. The reason for this different behavior is explained as follow: Context Control Word1, bit 17, Encrypt Hash Result = ‘1’ If this bit is set to 1, the remove_result operation will remove an encrypted value from the data stream. In this case, the operation is specifically meant for use with GHASH module (used in AES-GCM/GMAC algorithms) or XCBC module (used in AES-CCM algorithm). Note that the encrypt hash result bit may only be used in combination with one of these modes. Both modes require the result of the hash operation to be XOR-ed with an encrypted value to generate the actual ICV (integrity check value). Therefore, in the case of a GHASH (AES-GCM) or XCBC (AES-CCM) with the encrypt_hash_result bit set to ‘1’, the following instructions are used to generate the ICV. 1. A remove_result instruction is used to schedule removal of the first block (16 bytes) of the encrypted input data and storing it to internal registers, while the rest of the data stream is processed and the final hash result is calculated. The instruction will be executed after passing all packet data through the crypto engines. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 411 Appendix D: Inline Packet Engine Note 1: The first encrypted block cannot be used for encrypting data since doing so would open up the possibility of an attack on the nonce. Instead, this block is be used to encrypt the final hash result.) Note 2: The length field must be set to 16 (AES block size) for this mode to operate correctly. 2. The following INSERT instruction is used to pass the first encrypted block to the post processor. This operation inserts 16 bytes of zeros (AES block size) and performs an XOR with the first encrypted block in the data stream. This results in passing the first encrypted block directly to the Crypto Packet Processor postprocessor, for later encrypting (XOR’ing with) the hash result. 3. Once the final hash result is computed, it is encrypted (XOR-ed with the internally stored encrypted block of zeroes). This XOR operation is enabled by the encrypt hash result bit. 4. The result digest can be used by subsequent VERIFY or INSERT instructions as usual. Context Control Word1, bit 17, Encrypt Hash Result = ‘0’ In this case the remove_result operation is used to extract the hash result from the inbound (to be decrypted) data stream, after decryption by the packet engine. This is only relevant for protocols that specify the hash result to be encrypted after the plaintext payload data, such as SSL or TLS. Note: SSL and TLS require hash-then-encrypt for packet processing, automatically encrypting the hash result, unlike encrypt-then-hash operation for the IPSec outbound ESP transform. IRR Instruction (remove_result Operation) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode L NH CS length STAT P offset in the output data stream 0 0 – 0 – 1 0 1 0 0 – – – – – – – – – – – – – – – – – – – – – – Table D-1. IRR Instruction remove_result Operation Definition 412 Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L 26 NH The REMOVE_RESULT instruction requires the ‘L’, ‘NH’, and ‘CS’ fields to be set to ‘0’. 25 CS 24:19 length The "length" on 395 field indicates the number of bytes to be removed; this can be 12, 16, 20 or 32 bytes. If the length field is 0 (6’b00_0000): 64 bytes will be removed 18:17 STAT Refer to Table D-2, “Instruction Format,” on page 394 for a description of this field 16 P The REMOVE_RESULT instruction requires the ‘P’ field to be set to ‘0’. 15:00 offset in output data stream Removal starts at location specified by offset value in this field. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions INSERT_REMOVE_RESULT Instruction (insert_result Operation) This section describes the INSERT_REMOVE_RESULT (IRR) instruction functioning as an insert_result operation. In the Crypto Packet Processor configurations this instruction only be used to append data to the packet. The insert_result operation is typically used to update: 1. length, next header, and checksum fields in IPv4 or IPv6 headers in the output data stream, when processing an inbound transport mode ESP packet 2. hash result field in AH headers in the output data stream, when processing an outbound mode AH packet For inbound tunnel mode packets, header updates are not required as the full header is already present inside the ESP payload (see the reference list of RFC in “References” on page 577 for details on AH/ESP tunnel/transport modes). Note: The name of this instruction can be somewhat misleading as the instruction does not add data to the output data stream, but replaces existing data. It should not be confused with the REPLACE instruction, which operates on packet data in input data stream, before the data is submitted to the packet processing module, or the INSERT instruction, that actually does add data to the input data stream. The insert_result operation strictly operates on the output data stream, after packet processing. IRR Instruction (insert_result Operation) General Format 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode L NH CS length STAT P offset in output stream – – – 1 – 1 0 1 0 – – – – – – – – – – – – – – – – – – – – – – – Table D-1. IRR Instruction General Format Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L If set to ‘1’, indicates that the packet header ‘length’ field needs to be updated to match the actual, processed, packet. 26 NH If set to ‘1’, indicates that the IPv4/IPv6 packet header ‘Protocol’/’Next Header’ field needs to be updated to match the actual, processed, packet. 25 CS If set to ‘1’, indicates that the packet header Checksum field needs to be updated to match the actual, processed, packet. 24:19 length Indicates the hash result length or the relative location of the next header field or the relative location of the checksum field, if no NH is available. A length of 0 (6’b00_0000) is not allowed when indicating a relative location for next header or checksum. 18:17 STAT See Table D-2, “Instruction Format,” on page 394 for a description of this field. 16 P The insert_result operation requires this field to be set to ‘1’. 15:00 offset in output stream Indicates the location of the hash result in the packet or the location of the length field (byte pointer). If bits [15:1] are all set to ‘1’ the data is always appended to the end of the packet, rather than inserted (see Note 2). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 413 Appendix D: Inline Packet Engine Note 1: The insert_result operation should only be used for updating data in (as opposed to after) the output data stream. If data must be appended due to output buffer overflow, or if instructed by the ToO field in the token command word, then this is done aligned to a 32-bit boundary. Also this is done after the packet and the applicable field in the result token are set, even if the packet data did not end on a 32-bit boundary. Appending (as opposed to updating) hash result data after the data stream (possibly misaligned), according to the applicable protocol, should be done using an INSERT or REPLACE instruction (for example, IPSec ESP) with the t.o.dest. field set to output only. Note 2: These instructions with the exception of remove_result operation must always be located after the instruction Types 1 and 2. Examples of the INSERT_REMOVE_RESULT Instruction (insert_result Operations) This section provides examples of the INSERT_REMOVE_RESULT (IRR) instruction uses to perform the following specific insert_result operations. 1. Insert hash result. 2. Insert length, next header, and checksum. 3. Insert modified length and next header with checksum modification. 4. Insert modified length and next header without checksum modification. 5. Insert next header and checksum. 6. Insert modified length and checksum. 7. Insert modified length with or without checksum modification. 8. Insert next header with or without checksum modification. 9. Insert modified length. 10. Insert checksum. Where: length = IPv4 total packet length modified length = length value modified to equal IPv6 payload length checksum = update IPv4 packet header checksum field checksum modification = update internal checksum register only IRR Instruction Example (Insert Hash Result Operation) This operation example inserts the hash result (if available) at the location indicated by the offset in output stream field. The presence of this instruction in the token requires that a previous instruction with the hash done bit of the STAT field to be set — STAT[0] (bit [17] of a preprocessing instruction). Otherwise, an error will be generated since the hash operation was not completed. Note that if no hash operation was specified for the current packet, this operation will insert zeros in the data stream. This insert_result operation example is typically used to insert an AH header ICV field in the case of outbound mode. 414 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions IRR Instruction Example (Insert Hash Result Operation) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode L NH CS length STAT P offset in output stream 0 0 – 1 – 1 0 1 0 0 – – – – – – 1 – – – – – – – – – – – – – – – Table D-1. IRR Instruction Example Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L This operation requires the ‘L’, ‘NH’, and ‘CS’ fields to be set to ‘0’. 26 NH 25 CS 24:19 length The "length" on 395 field indicates the length of the inserted hash result. Note: When updating (inserting or appending) a hash result, a length of 0 (6’b00_0000) results in an update of 64 bytes. 18:17 STAT[0] This bit must be set to ‘1’. Since this operation does not modify the packet header, it does not update the internally calculated header checksum. Therefore, the STAT[0] must be set to ‘1’, indicating that no checksum update is required as a result of this operation. 16 P This operation requires this field to be set to ‘1’. 15:00 offset in INput stream The location in the output data stream to insert the hash result. If all bits in this field are set to 1, the hash result is appended to the packet. This appending is 32-bit aligned, regardless of the byte alignment at the end of the packet. Note: Figure D-20 applies to packets that are less than 1792 bytes in length. For larger packets, the 12-byte ICV field is appended to the packet by default, in which case Figure D-21 applies. Byte alignment within 32-bit words Bit 31 Bit 0 B3 B3 B7 B11 B2 B2 B6 B10 B1 B1 B5 B9 B0 B0 B4 B8 IP Header AH Header Packet Data Hash Result (ICV) Calculated by Crypto Packet Processor and Inserted in the AH Header ICV Field (Starting with Byte 0) Figure D-20: IPv4 Inserted Field - insert_hash_result Operation - AH Header ICV Field Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 415 Appendix D: Inline Packet Engine Byte alignment within 32-bit words Appended Data ICV Bit 31 Bit 0 B3 B3 B7 B11 B2 B2 B6 B10 B1 B1 B5 B9 B0 B0 B4 B8 IPv4 Header AH Header Packet Data Hash Result (ICV) Calculated by Crypto Packet Processor and (Starting with Byte 0) Figure D-21: IPv4 Appended Field – insert_hash_result_operation – AH Header IVC Field IRR Instruction Examples (Modifying IP Header Using insert_result Operations) The insert_result operation can also be used to modify IP header fields: L, NH, and CS. The following subsections present IRR instruction examples that modify the IP header using insert_result operations. Since these example operations always modify the IP header, updating the internal checksum is always required, and is enabled by setting the STAT[0] bit to ‘0’ in the instruction. Insert_Result Operation Example (Insert Length, Next Header and Checksum) This insert_result operation example is typically used for normal IPv4 header updates, in the case of inbound ESP transport mode. In this example, each of the L, NH, and CS fields is enabled, set to ‘1’. The operation assumes that the next header and checksum fields are in adjacent fields in the IPv4 header. See Figure D-22. The checksum field immediately follows the protocol field, which is updated with the next header value. This instruction inserts the actual packet length (calculated by the Crypto Packet Processor after processing) at the location indicated by the offset in output stream field. The length field indicates the positive offset relative to offset in output stream field where the next header value is to be inserted (IPv4 protocol field); the checksum is inserted immediately next to this location. The next header value used for the insertion is retrieved from the padding. See Figure D-24. The inserted checksum is the result of the addition of the current checksum value (calculated so far by the Crypto Packet Processor) plus length and next header, where next header is added to the 8 MSbs of checksum. In the IPv4 header, the location of the checksum field (immediately adjacent to protocol field) and the way the protocol field must be added to checksum, is fixed and corresponds to the order of these fields. In IPv6, this operation updates only payload length and next header fields. The payload length is inserted at the location pointed by the offset in output stream field. The length field must reflect the actual size of the IPv6 packet header for this packet. The contents of the length field are subtracted from the (Crypto Packet Processor internally calculated packet length) in bytes to obtain the actual payload length to be inserted in the header. See Figure D-27. In IPv4 and IPv6, if all bits in the offset in output stream field are set to 1, the applicable updated fields are appended to the packet. This appending is 32-bit aligned, regardless of the byte alignment at the end of the packet. 416 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions IRR Instruction Example (Insert Length, Next Header, and Checksum) In this example the inserted length is the total length. This is the result packet length field in the result token. This must be equal the required value in the IPv4 total length field; otherwise the wrong length will be inserted. Therefore, when inserting data in front of the IP header, make sure that appropriate postprocessing of the result packet is done to correct the length field (and checksum if needed) in the IP header. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 0 1 0 L NH CS length 1 1 1 – – – – – – STAT P offset in output stream – 1 – 0 – – – – – – – – – – – – – – – Table D-2. IRR Instruction Example Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 L This operation requires the ‘L’, ‘NH’, and ‘CS’ fields to be set to ‘1’. 26 NH 25 CS 24:19 length The"length" on 395 field indicates the positive offset relative to ‘offset in output stream’ field where the next header value is to be inserted; the checksum is inserted immediately next to this location. 18:17 STAT[0] This bit must be set to ‘0’, indicating that a checksum update is required as a result of this operation. 16 P This operation requires this field to be set to ‘1’. 15:00 offset in output stream The location in the output data stream where the actual packet length calculated by the Crypto Packet Processor is inserted after packet processing. If all bits in this field are set to 1, appropriate fields are appended to the packet. See Figure D-22 (IPv4) and Figure D-23 (IPv6). Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 417 Appendix D: Inline Packet Engine 4 Bits 32 Bits verson IHL type of service verson total packet length IHL total packet length protocol time to live header checksum source address destination address option payload Figure D-22: IPv4 Header Datagram 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 Length byte 1 (LSB) Length byte 0 (MSB) type of service version frag offset part 1 flags identifier byte 1 identifier byte 0 checksum byte 1 checksum byte 0 protocol time to live source address byte 3 source address byte 2 source address byte 1 source address_0 dest. address byte 3 dest. address byte 2 dest. address byte 1 dest. address byte 0 frag off. part 0 02 01 00 IHL 4 Bits 32 Bits verson traffic class total packet length verson next header hop limit source address destination address payload Figure D-23: IPv6 Header Datagram 418 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 flow label nibble 3 flow label nibble 4 flow label nibble 1 flow label nibble 2 traffic class nibble 1 flow label nibble 0 version traffic class nibble 0 hop limit next header payload length byte 1 (LSB) payload length byte 0 (MSB) source address byte 3 source address byte 2 source address byte 1 source address byte 0 source address byte 7 source address byte 6 source address byte 5 source address byte 4 source address byte 11 source address byte 10 source address byte 9 source address byte 8 source address byte 15 source address byte 14 source address byte 13 source address byte 12 dest. address byte 3 dest. address byte 2 dest. address byte 1 dest. address byte 0 dest. address byte 7 dest. address byte 6 dest. address byte 5 dest. address byte 4 dest. address byte 11 dest. address byte 10 dest. address byte 9 dest. address byte 8 dest. address byte 15 dest. address byte 14 dest. address byte 13 dest. address byte 12 Byte alignment within 32-bit words ‘protocol’ field updated with instruction ‘next header’ value Bit 31 B3 B2 total length checksum Packet Data P B1 Bit 0 B0 IPv4 Header Start value = instruction ‘length’ field = 6’h07 Start value = instruction ‘offset in output stream’ field = 16’h0002 Figure D-24: IPv4 Updated Fields - insert_result Operation - ‘L’, ‘NH’, ‘CS’ and ‘P’ Fields Selected Note: Figure D-24 applies to packets that are less than 1792 bytes in length. For larger packets, updated fields are appended to the packet by default, in which case Figure D-25 applies. Byte alignment within 32-bit words Appended Data Bit 31 B3 B2 16’b00 16’b00 16’b00 IPv4 Header Packet Data B1 Bit 0 total length B0 P 8’b0 checksum Actual Packet Length Calculated by Crypto Packet Processor Update Checksum Value Calculated by Crypto Packet Processor Figure D-25: IPv4 Appended Updates - insert_result Operation - ‘L’, ‘NH’ and ‘CS’ Fields Selected Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 419 Appendix D: Inline Packet Engine Note: Figure D-25 applies to packets that are equal to or larger than 1792 bytes in length. Figure D-25 also applies to packets processed with an insert_result operation with the instruction length field = 16’hFFFE. Appended updated data always starts on a 32-bit aligned position. See Table D-4. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 reserved reserved Length byte 1 (LSB) Length byte 0 (MSB) reserved reserved protocol reserved reserved reserved checksum byte 1 checksum byte 0 Byte alignment within 32-bit words Bit 31 B3 Bit 0 IPv6 Header B2 NH B1 payload length Packet Data B0 Start value = instruction ‘offset in output stream’ field = 16’h0004 Figure D-26: IPv6 Updated Fields - insert_result Operation - ‘L’, ‘NH’, ‘P’ Fields Selected and ‘length’ = 0x28 Note: Figure D-26 applies to packets that are less than 1792 bytes in length. For larger packets, updated fields are appended to the packet by default, in which case Figure D-27 applies. Byte alignment within 32-bit words Appended Data Bit 31 B3 B2 16’b00 16’b00 IPv6 Header Packet Data B1 Bit 0 payload length B0 payload length NH ‘offset in output stream’ field = 16,hFFFE = append to packet on a 32-bit boundary Figure D-27: IPv6 Appended Updates - insert_result Operation - ‘L’, ‘NH’, ‘P’ Fields Selected and ‘length’ = 0x28 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 reserved reserved payload length byte 1 (LSB) payload length byte 0 (MSB) reserved reserved reserved Next header The following insert-result operation examples are similar to this example with one of more options disabled. The diagrams above are applicable to the following sections. 420 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions If multiple insert_result operations are used and data is appended to a packet, the order of the appended data will be in the order of the operations in the token. Insert_Result Operation Example This field inserts a Modified Length and Next Header with Checksum Modification. There is currently no actual “use case” for this operation. This operation updates the payload length field of an IP packet header with the length of the result packet. The payload length is inserted at the location pointed by the offset in output stream field. This operation updates the next header field of an IP packet header. The next header value is retrieved from the padding and is inserted at the length number of bytes from the payload length field. This operation will update the internal checksum value but will not insert the checksum immediately next to the protocol field as is appropriate for IPv4 header. To insert the checksum later, a separate insert_result operation must be used. Refer to “Context Control Instructions” on page 425, Insert_Result Operation Example (Insert Checksum). Example: To update the length and next header fields in the IPv4 header located in front of the packet, the offset in output stream field value must be 16h’0002 and length field value must be 6h’07 as shown in Figure D-24. Insert Modified Length and Next Header with Checksum Modification 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 0 1 0 L NH CS length 1 1 0 – – – – – – STAT P offset in output stream – 1 – 0 – – – – – – – – – – – – – – – Insert_Result Operation Example This field inserts a modified length and Next Header w/o Checksum Modification. This insert_result operation example is typically used to update an IPv6 header in the case of inbound ESP transport mode. This operation updates the payload length and next header fields of an IPv6 packet header. The payload length is inserted at the location pointed by the offset in output stream field. The length field must reflect the actual size of the IPv6 packet header for this packet. The contents of the length field are subtracted from the (Crypto Packet Processor internally calculated packet length) in bytes to obtain the actual payload length value to be inserted in the header. The next header value is retrieved from the padding and is inserted immediately after the inserted payload length. The internally calculated checksum is not updated, as indicated by STAT[0] bit set to 1. Insert Modified Length and Next Header w/o Checksum Modification 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 0 1 0 L NH CS length 1 1 0 – – – – – – STAT P offset in output stream – 1 – 1 – – – – – Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice – – – – – – – – – – 421 Appendix D: Inline Packet Engine Insert_Result Operation Example (Insert Next Header and Checksum) There is currently no actual “use case” for this operation. This operation inserts the next header and checksum fields. The next header is inserted at the location pointed by the offset in output stream field. The length field of the instruction indicates the positive offset to the checksum in the data stream. The next header value is retrieved from the padding. This operation updates the internally calculated checksum before inserting the checksum in the data stream. Insert Next Header and Checksum 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode L NH CS length STAT P offset in output stream 0 1 – 1 – 1 0 1 0 1 – – – – – – 0 – – – – – – – – – – – – – – – Insert_Result Operation Example (Insert Modified Length and Checksum) There is currently no actual “use case” for this operation. This operation updates the payload length and IPv6 packet header. The payload length is inserted at the location pointed by the offset in output stream field. The length field must reflect the actual size of the IPv6 packet header for this packet. The contents of the length field are subtracted from the Crypto Packet Processor internally calculated packet length in bytes to obtain the actual payload length value to be inserted in the header. The inserted checksum is the result of the addition of the current checksum value plus inserted modified length. Insert Modified Length and Checksum 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode L NH CS length STAT P offset in output stream 1 0 – 1 – 1 0 1 0 1 – – – – – – 0 – – – – – – – – – – – – – – – Insert_Result Operation Example (Insert Modified Length w/ or w/o Checksum Modification) This operation updates the payload length an IPv6 packet header. The payload length is inserted at the location pointed by the offset in output stream field. The length field must reflect the actual size of the IPv6 packet header for this packet. The contents of the length field are subtracted from the (Crypto Packet Processor internally calculated packet length) in bytes to obtain the actual payload length value to be inserted in the header. If P is set 0, the modification is added to the postprocessing checksum, otherwise (P=1) the modification is added to the preprocessing checksum and is the postprocessing checksum overwritten with the result. The STAT[0]: field indicates if the internally calculated checksum is updated. If STAT[0]: = ‘0’, the internal checksum register is updated. If STAT[0]: = ‘1’, the internal checksum register is not updated. 422 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions Insert Modified Length w/ or w/o Checksum Modification 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 0 1 0 L NH CS length 1 0 0 – – STAT – – – – – P 0/1 1 offset in output stream – – – – – – – – – – – – – – – – Insert_Result Operation Example (Insert Next Header w/ or w/o Checksum Modification) This operation inserts a next header value at the location pointed by the offset in output stream field. The length field of this instruction can be 0. If P is set to 0, the modification is added to the postprocessing checksum, otherwise (P=1) the modification is added to the preprocessing checksum and is the postprocessing checksum overwritten with the result. The STAT[0]: field indicates if the internally calculated checksum is updated. If STAT[0]: = ‘0’, the internal checksum register is updated. If STAT[0]: = 1, the internal checksum register is not updated. Insert Next Header w/ or w/o Checksum Modification 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 0 1 0 L NH CS length 0 1 0 – – STAT – – – – – P 0/1 1 offset in output stream – – – – – – – – – – – – – – – – Insert_Result Operation Example (Insert Checksum) This operation inserts checksum field. The checksum is inserted at the location indicated by the offset in output stream field. The inserted checksum is the result of the addition of the current checksum plus inserted fields before this instruction. Insert Checksum 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 0 1 0 L NH CS length 0 0 1 – – – – – – STAT P offset in output stream – 1 – 0 – – – – – – – – – – – – – – – D.5.6.2 REPLACE_BYTE Instruction The REPLACE_BYTE instruction overwrites one byte of data in the output data stream. The byte used to overwrite is located in the instruction’s data byte field. The byte to be overwritten is located by the offset in output stream field. The REPLACE_BYTE instruction can also be used to append one byte of data in the output data stream by setting the offset in output stream field to 0xFFFE. In this case, a full dword is appended to the output data stream with the instruction’s data byte field located in bits [7:0] of the dword, with dword bits [31:8] set to 0. Note also that the B bit in the result token will be set, indicating that the data is appended. Please note that in the Crypto Packet Processor configurations only the last option of the REPLACE_BYTE instruction is available. The REPLACE_BYTE instruction can only be used to append a single byte, because of the output buffer’s limited size. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 423 Appendix D: Inline Packet Engine REPLACE_BYTE 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 0 data byte 1 1 – – – – – – – – R STAT R offset in output stream 0 – 0 – – – – – – – – – – – – – – – – – Table D-3. REPLACE_BYTE Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27:20 data byte The byte of data to be used to overwrite a byte in the output data stream or appended to the data stream. 19 R reserved. Set to 0. 18:17 STAT Refer to Table D-2, “Instruction Format,” on page 394 for a description of this field. 16 R reserved. Set to 0. 15:00 offset in output stream The offset to the data byte in the output data stream to be overwritten. Refer to "opcode" for a description of this field. D.5.6.3 Reserved Instructions The opcodes for reserved instructions may not be used. RESERVED 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 1 0 0 – D.5.7 – – – – – – – – – – – – – – – – – – – – – – – – – – – Result Instructions The VERIFY_FIELDS instruction is currently the only RESULT instruction. D.5.7.1 VERIFY_FIELDS Instruction This instruction verifies context record fields or other retrieved values against other retrieved or calculated values. These values can be the hash result, padding, checksum, SPI, or sequence number. Comparison of retrieved, calculated and/or context values can only be assessed for inbound packets. The VERIFY_FIELDS instruction must always be the last instruction sent after execution instructions. It cannot be followed by additional execution instructions. It can only be followed by a Context Control Instruction. This instruction is used for generating error codes E9 through E13. See Table D-8, “Error Codes,” on page 430 for error code descriptions. If no VERIFY_FIELDS instruction is used, these error conditions are not detected. 424 Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions VERIFY_FIELDS 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 1 0 1 S SP CS P reserved – – – – – – – – – STAT H length 1 0 0 1 0 0 0 0 0 0 0 0 – – – – – 0 0 Table D-4. VERIFY_FIELDS Definition Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27 S Sequence Number. Set this bit to ‘1’ to verify the sequence number. Refer to “Sequence Number Check” on page 450. 26 SP SPI. Set this bit to ‘1’ to verify the retrieved SPI. 25 CS Checksum. Set this bit to ‘1’ to verify the checksum. 24 P Padding. Set this bit to ‘1’ to verify padding. (Set to ‘1’ only if the padding type allows verification.) 22:19 reserved Reserved. These bits should be set to ‘0’. 18:17 STAT Status. This field must always be set to ‘11’. 16 H Hash. Set this bit to ‘1’ to verify the hash result. 15:00 length Length. Indicates the number of hash result bytes that needs to be compared, valid values are: 0001100 12 bytes 0010000 16 bytes 0010100 20 bytes (SHA-1 and SHA-2 only) 0011000 24 bytes (SHA-2 only) 0011100 28 bytes (SHA-2 only) 0100000 32 bytes (SHA-2 only) 0110000 48 bytes (SHA-2 only) 1000000 64 bytes (SHA-2 only) D.5.8 Context Control Instructions The CONTEXT_ACCESS instruction is currently the only Context Control instruction supported. D.5.8.1 CONTEXT_ACCESS Instruction The CONTEXT_ACCESS instruction is used to update fields in the external context record. The offset field is an external relative offset pointing to the base address of the context record. The origin field a second offset pointing to the internal field(s) that need to be updated. In IPSec, this instruction is typically used to update the sequence number of the context record, and if it is an inbound packet, the sequence number mask fields. Optionally, the IV fields can also be updated using a separate CONTEXT_ACCESS instruction. The CONTEXT_ACCESS instruction is only executed after all Crypto Packet Processor processing completes (preprocessing, packet engine processing and postprocessing), unless both the Fail and Pass (F and P) fields are set to 0, in which case the CONTEXT_ACCESS instruction can be executed at any time. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 425 Appendix D: Inline Packet Engine Using the F and P fields (together referred to as the result type field) the CONTEXT_ACCESS instruction can be selectively executed if a packet passes or fails. If the result type field is set to 01, the instruction will be executed if the packet passed. If the result type field is set to 10, the instruction will be executed if the packet failed. CONTEXT_ACCESS 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 result type opcode 1 1 length 1 0 – – origin – – – – – – – STAT res – 0 – 0 U F P D reserved offset – – – – 0 – 0 0 – – – – – – – data (optional), max = 1 dword Table D-5. CONTEXT_ACCESS Bits Name Description 31:28 opcode Refer to "opcode" for a description of this field. 27:24 length The "length" on 395 field indicates the number of dwords that need to be transferred. 23:19 origin The ‘origin’ field a offset pointing to the internal context record field(s) that need to be updated. All internal registers can be read for insertion into the external context record. Note that only general-purpose, IV, sequence number (and mask) internal registers are writable using this instruction. Typically, only read actions are required to update the external context record. Note that in Table D-6, some of the internal registers are ordered to allow multiple fields to be inserted with one instruction. For example, the registers highlighted in yellow (sequence number result and sequence number mask, or IV0 through IV3) are typically written to the external context record with one instruction. 426 18:17 STAT = 11 – last instruction (optionally set) 16:15 res Reserved, must be set to 00. 14 U Use token data when processing this instruction. There can only be one dword of data immediately following the token instruction. The ‘origin’ field must be set to 11011 and "length" on 395 field must be 0001 in this case. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Processing Instructions Table D-5. CONTEXT_ACCESS (continued) Bits Name Description 13 F 12 P Fail and Pass bit combinations. Together, these bits are also referred to as the ‘result type’ field. 00 Execute the CONTEXT_ACCESS instruction immediately. When both bits are set to ‘00’, this instruction can be executed anytime, even before packet processing. 01 Execute the CONTEXT_ACCESS instruction only if all Crypto Packet Processor processing has completed successfully, (without errors). 10 Execute the CONTEXT_ACCESS instruction only if all Crypto Packet Processor processing has completed unsuccessfully. In this case the VERIFY_RESULT instruction resulted in one of more of the following errors: error codes E9 through E13; see Table D-8 for error code descriptions. Executing the CONTEXT_ACCESS instruction in these cases could be useful in debugging or keeping statistical data. 11 Execute the CONTEXT_ACCESS instruction only if all Crypto Packet Processor processing has completed, successfully or unsuccessfully. 11 D Direction. In the Crypto Packet Processor this bit must be set to 1. This indicates that the CONTEXT_ACCESS instruction results in the Crypto Packet Processor updated the external context record, also know as a ‘context write’ operation to the external context record. Note: In standard (protocol) scenarios the ‘context read’ operation (when the D-bit is set to 0) is not used. In the Crypto Packet Processor configurations only ‘context write’ operations are allowed. The ‘context read’ operations should not be started — only sequential fetches from the context input FIFO that are initiated by the Crypto Packet Processor itself. 10:8 reserved Reserved bits must be set to 0. 7:00 Offset The (32-bit word) offset in the context record. Table D-6. ‘Origin’ Field Encoding for CONTEXT_ACCESS Instruction ‘origin’ field value Context Record Fields 00000 reserved 00001 reserved 00010 reserved 00011 reserved 00101 sequence number 00110 sequence number mask (length can be 2 or 4) 00111 reserved (sequence number mask 2nd word) 01000 reserved (sequence number mask 3th word) 01001 reserved (sequence number mask 4th word) 01010 sequence number 01011 extended sequence number Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 427 Appendix D: Inline Packet Engine Table D-6. ‘Origin’ Field Encoding for CONTEXT_ACCESS Instruction (continued) ‘origin’ field value 1 428 Context Record Fields 01100 sequence number mask (length can be 2 or 4) 01101 reserved (sequence number mask 2nd word) 01110 reserved (sequence number mask 3th word) 01111 reserved (sequence number mask 4th word) 10000 general purpose register 0 10001 general purpose register 1 10010 … 10011 reserved 10100 IV0 10101 IV1 10110 IV2 10111 IV3 11000 hash result count 11001 ARC4 IJ-pointer 11010 ARC4 state record length is “don’t care” and will be forced to 64 (32-bit words) offset is don’t care and will be ignored: External DMA address is ARC4-state pointer from context record 11011 from token (see U, bit 15) 11100 hash result digest (length can be 4, 5, 8 or 161) 11101 reserved 11110 reserved 11111 reserved If the selected origin field is hash result digest (11100), a length of 16 exceeds the width of the length field. To achieve an update with length 16 (for SHA512): a length of 0’ must be selected in the instruction. This results in a transfer length of 16 (32-bit) words. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Result Token Definition D.5.9 Bypass Token Data – Special Instruction It is possible append token data to the result token without observation by the packet engine. This special instruction can be used to bypass data from the input token through to the result token. If used, it must always be the last instruction. This data stream must start with four bits set to 1 (opcode). As a result, only 28 bits are available in the first dword. The maximum length is four dwords, including the opcode bits. BYPASS_TOKEN_DATA 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 opcode 1 1 data 1 1 data (28 bits only) data (optional), maximum of 3 dwords beyond the initial 28 bits above D.6 Result Token Definition A result token is generated by the Crypto Packet Processor for every input token processed. The result token contains the following fields resulting from packet processing. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 reserved bypass data E15 H E0 E1 E3 E2 E4 hash bytes E5 E10 B E6 E11 C E7 E12 N E8 E13 L E9 E14 result packet length output packet pointer reserved pad length next header field (IPSec padding only) bypass token data (optional: maximum of 4 words) Two errors in the result token have a different behavior in the Crypto Packet Processor configurations: • In the standard Crypto Packet Processor, error E0 occurs when either the pre-processor detects a wrong length (instruction lengths, version packet length) or if the input DMA fetch reports an error. The latter can not happen in the Crypto Packet Processor, however a similar case is detected and reports the same error instead. If data_in_done is asserted before the amount of written words matches with the packet length field from the input token, the Crypto Packet Processor will generate an E0 error. • An E15 error cannot occur in the Crypto Packet Processor, and is therefore always 0. Note: Error E14 should never occur in the Crypto Packet Processor; this bit indicates a situation where the timeout counter “fires”. The timeout counter runs if no data movement within the Crypto Packet Processor is detected and fires after this situation has persisted for a fixed number of clock cycles (approximately 1000 clock cycles). The actual number of clock cycles is dependent on the Crypto Packet Processor core configuration. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 429 Appendix D: Inline Packet Engine Table D-7. Result Token Definition Bits Name Description 31 E14 30 E13 29 E12 Error Codes. Table D-8 describes the different error codes returned by the Crypto Packet Processor. It is possible to have multiple error codes in one packet. Note that if one of more fatal error (errors highlighted in yellow) occurs, the packet will not be processed correctly and should be dropped. A fatal error only occurs if the input token or context record is incorrect. 28 E11 27 E10 26 E9 26 E8 24 E7 23 E6 22 E5 21 E4 20 E3 19 E2 18 E1 17 E0 16:00 Result Packet Length The result packet length equals the length of the packet that is written out of the Crypto Packet Processor, not including the appended result fields (see to the ‘packet info fields’ in result token). 4 E15 Error Codes. Table D-8 describes the different error codes returned by the Crypto Packet Processor. It is possible to have multiple error codes in one packet. Note that if one of more fatal error (errors highlighted in yellow) occurs, the packet will not be processed correctly and should be dropped. A fatal error only occurs if the input token or context record is incorrect. Table D-8. Error Codes 430 Error Codes Description E0 Packet length error: token instructions versus input or input DMA fetch. E1 Token error, unknown token command/instruction. E2 Token contains too much bypass data. E3 Cryptographic block size error (ECB, CBC). E4 Hash block size error (basic hash only). E5 Invalid command/algorithm/mode/combination. E6 Prohibited algorithm. E7 Hash input overflow (basic hash only). E8 TTL / HOP-limit underflow. E9 Authentication failed. E10 Sequence number check failed / Sequence number roll-over detected. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Result Token Definition Table D-8. Error Codes (continued) Error Codes Description E11 SPI check failed. E12 Checksum incorrect. E13 Pad verification failed. E14 Time-out - FATAL ERROR, (see note below). E15 Output DMA error. Packet Info Fields – (H, hash byte, B, C, N, and L) The following fields are collectively called the Packet Info Fields: H, hash byte, B, C, N, and L. If a packet exceeds 1792 bytes (the output packet buffer size: 2048 minus 256), the result data must be appended to the output packet in the output buffer before the packet is actually written out. For example, updating the IPv4 header fields after stripping the padding (length field, protocol, and checksum). For each header field, one dword is appended to the packet. Possible header fields are: • Result packet length (IPv4 inbound) or payload length (IPv6 inbound). • Next header field, retrieved from de-padding (IPv4 or IPv6 inbound). • A re-calculated checksum after replacing the length and next header (IPv4 inbound). • The hash result, the number of bytes indicated by hash bytes field (AH outbound) • Generic byte(s). The “packet info fields” are comprised of the following fields: • H: Hash byte(s) appended (as a result of an insert_result instruction); the number of appended bytes is indicated by the hash bytes field. • hash bytes: The number of appended hash bytes. Please refer to the insert_result instruction in “IRR Instruction Example (Insert Hash Result Operation)” on page 414). • B: generic byte(s) appended (as a result of a REPLACE_BYTE instruction). • C: checksum appended (as a result of an insert_result instruction). • N: next header field appended (as a result of an insert_result instruction). • L: length field appended (as a result of an insert_result instruction). bypass data This field indicates the length of the result token bypass data in dwords. output packet pointer This is a direct copy of the ‘output packet pointer’ from the input token. Refer to “Output Packet Pointer (token dword [2], Required)” on page 391 for details. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice 431 Appendix D: Inline Packet Engine next header The next header field contains the next header result value intended for updating the IP header, specifically, the IPv4 ‘protocol’ field or IPv6 next header field. This value is retrieved during de-padding. The next header field is only applicable to IPSec padding; otherwise, this field is set to 0 (8’h00). pad length The pad length field contains the number of detected (and removed) padding bytes. Applicable padding types are PKCS#7, RTP, IPSec, TLS and SSL. Otherwise, this field is set to 0 (8’h00). D.7 Pre and Post-Processing by Host Software D.7.1 Preprocessing Incoming packet must be preprocessed in order to generate a proper token for the Crypto Packet Processor. During preprocessing the following operations should be considered and executed when applicable. 432 • FCS: Verify the 32 bit CRC of Ethernet frame. • Ethernet hdr: Assure that the destination MAC address in the Ethernet header matches with the device’s MAC address. • Ethernet hdr: Assure that the protocol type specified in the Ethernet header is either IPv4 (0x0800) or IPv6 (0x86dd). • Ethernet hdr: Store the size of the Ethernet header in order to insert the exact Ethernet length when removing the Ethernet header later. • IPv4 hdr: If the Ethernet Type field indicates IPv4, verify that the version field in the IPv4 hdr has the value 4. If not 4, then drop and do not send to Crypto Packet Processor • , since packet is obviously corrupt. • IPv4 hdr: Examine the IHL field to determine if IPv4 options are available. • IPv4 hdr (optional): For ECN (Explicit Congestion Notification) the ‘Type of Service’ field could be copied from inner to outer header (and visa versa) for tunnel modes. • IPv4 hdr: The packet length field should be stored in order to pass proper payload lengths to several Crypto Packet Processor • token commands • IPv4 hdr: The protocol field should be stored to insert into IPv4chksum commands • IPv4 hdr: The checksum should be validated before passing the packet to the Crypto Packet Processor. The IPv4chksum command can update the checksum, but does not verify its correctness. • IPv4 hdr (AH transport only): Determine what options are mutable according to AH. • IPv6 hdr: If the Ethernet Type field indicates IPv6, verify that the version field in the IPv6 hdr has the value 6. If it is not 6, then drop it and do not send it to Crypto Packet Processor, since the packet is obviously corrupt. • IPv6 hdr (optional): For Explicit Congestion Notification (ECN) the Traffic Class field could be copied from the inner to the outer header (and visa versa) for tunnel modes. • IPv6 hdr: The payload length field should be stored in order to pass proper payload lengths to several Crypto Packet Processor token commands. Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors Tilera Confidential — Subject to Change Without Notice Context Record Definition • IPv6 hdr: Next header field should be stored to insert into IPv6 command and/or insert commands. For an extensive set of example tokens, refer to document: Appendix E: “Inline Packet Engine — Token Examples” on page 491. D.7.2 Post-Processing D.7.2.1 Result Token After the Crypto Packet Processor finishes processing a token, a result token is generated. If no errors occurred during processing, output data will also be available. D.7.2.2 Appended Data When the Crypto Packet Processor has finished processing a token, data is sometimes appended to the packet. When there is appended data, the result token provides information on what is appended, however it does not indicate in what order data is appended. Three instruction cause different append bits to be set in the result token. These are: • The RESULT instruction with L, NH, CS bits not set -> hash bytes defined • The RESULT instruction with L, NH and/or CS bits set -> L, NH and/or CS bit set • The REPLACE instruction with undefined data to append -> B bit set The order data is appended to the output data is the order the instructions were provided in the token. For the RESULT instruction with L, NH and/or CS bits set, the appended data order is always Length-NH-checksum. D.7.2.3 Suggested Post-Processing Operations During post-processing the following operations should be considered and executed when applicable. • Result token: verify that no error bits are set. • Result token: test if Length (L bit) has been appended. If it has been appended, replace the length in the IPv4 / IPv6 header with the appended length. In the case of IPv6, bear in mind that the length in the IPv6 header is the payload length and not the packet length. • Result token: test if the next header (NH bit) has been appended. If so, replace the length in the IPv4- or IPv6-header. • Result token: test if the checksum bit has been set and therefore a 16-bit checksum has been appended. The checksum should replace the checksum in the IPv4 header • Result token: next header field: fo