TILE
PROCESSOR
AND I/O
DEVICE GUIDE
FOR THE TILE-GX FAMILY OF PROCESSORS
RELEASE 1.12
DOC. NO. UG404
OCTOBER 2014
TILERA CORPORATION
Copyright © 2010-2014 Tilera Corporation. All rights reserved. Printed in the United States of America.
No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, recording, or otherwise, except as may be expressly permitted by the applicable copyright statutes or in writing by the
Publisher.
The following are registered trademarks of Tilera Corporation: Tilera and the Tilera logo.
The following are trademarks of Tilera Corporation: Embedding Multicore, The Multicore Company, Tile Processor, TILE Architecture,
TILE64, TILEPro, TILEPro36, TILEPro64, TILExpress, TILExpress-64, TILExpressPro-64, TILExpress-20G, iMesh, TileDirect,
TILExtreme-Gx, TILExtreme-Gx Duo, TILEmpower, TILEmpower-Gx, TILEmpower-Gx36, TILEmpower-Gx72, TILEncore, TILEncorePro,
TILEncore-Gx, TILEncore-Gx9, TILEncore-Gx16, TILEncore-Gx36, TILEncore-Gx72, TILE-Gx, TILE-Gx9, TILE-Gx16, TILE-Gx36,
TILE-Gx72, TILE-Gx8072, TILE-Gx3000, TILE-Gx5000, TILE-Gx8000, TILE-Gx8009, TILE-Gx8016, TILE-Gx8036, TILE-Gx3036, DDC
(Dynamic Distributed Cache), Multicore Development Environment, Gentle Slope Programming, TMC (Tilera Multicore Components),
hardwall, Zero Overhead Linux (ZOL), MiCA (Multicore iMesh Coprocessing Accelerator), and mPIPE (multicore Programmable
Intelligent Packet Engine). All other trademarks and/or registered trademarks are the property of their respective owners.
Third-party software: The Tilera IDE makes use of the BeanShell scripting library. Source code for the BeanShell library can be found at the
BeanShell website (http://www.beanshell.org/developer.html).
This document contains advance information on Tilera products that are in development, sampling or initial production phases. This
information and specifications contained herein are subject to change without notice at the discretion of Tilera Corporation.
No license, express or implied by estoppels or otherwise, to any intellectual property is granted by this document. Tilera disclaims any
express or implied warranty relating to the sale and/or use of Tilera products, including liability or warranties relating to fitness for a
particular purpose, merchantability or infringement of any patent, copyright or other intellectual property right.
Products described in this document are NOT intended for use in medical, life support, or other hazardous uses where malfunction could
result in death or bodily injury.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS IS” BASIS. Tilera assumes no liability for damages
arising directly or indirectly from any use of the information contained in this document.
Publishing Information:
Document number:
UG404
Release
1.12
Date
15 October 2014
Contact Information:
Tilera Corporation
Information info@tilera.com
Web Site http://www.tilera.com
Contents
PREFACE
About this Manual .................................................................................................................................. xxi
Intended Audience ................................................................................................................................. xxi
Manual Contents Description .............................................................................................................. xxi
Related Documents .............................................................................................................................. xxiii
Technical or Customer Support ........................................................................................................ xxiii
Product Information ............................................................................................................................ xxiii
Notation Conventions ......................................................................................................................... xxiii
Conventions for Register Descriptions .............................................................................................xxiv
Conventions for Processor Families ............................................................................................................. xxiv
Byte and Bit Order .......................................................................................................................................... xxiv
Reserved Fields ................................................................................................................................................. xxv
Numbering ........................................................................................................................................................ xxv
CHAPTER 1 I/O DEVICE INTRODUCTION
1.1 Overview ................................................................................................................................................ 1
1.1.1 Tile-to-Device Communication ................................................................................................................. 1
1.1.2 Coherent Shared Memory .......................................................................................................................... 2
1.1.3 Device Protection ........................................................................................................................................ 2
1.1.4 Interrupts ...................................................................................................................................................... 2
1.1.5 Device Discovery ......................................................................................................................................... 3
1.1.6 Common Registers ...................................................................................................................................... 3
CHAPTER 2 TILE PROCESSOR
2.1 System Architecture Overview .......................................................................................................... 7
2.2 Memory Architecture .......................................................................................................................... 8
2.3 Memory Addressing ............................................................................................................................ 9
2.3.1 TLB Management ........................................................................................................................................ 9
2.3.1.1 TLB Miss Handling ....................................................................................................................... 24
2.4 Memory Consistency Model ............................................................................................................ 26
2.4.1 Overview .................................................................................................................................................... 26
2.5 TILE-Gx Page Attribute Transitions and Cache Flushes ........................................................... 28
2.6 Protection ............................................................................................................................................. 29
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
iii
Contents
2.6.1 Levels of Protection ................................................................................................................................... 29
2.6.2 Protected Resources .................................................................................................................................. 29
2.7 Interrupt Model ...................................................................................................................................29
2.7.1 Introduction ................................................................................................................................................ 29
2.7.1.1 Interrupt/Exception State ............................................................................................................ 30
2.7.1.2 Nested Interrupts/Exceptions ..................................................................................................... 31
2.7.1.3 Interrupt Traits ............................................................................................................................... 31
2.7.1.4 Interrupt Masks .............................................................................................................................. 32
2.7.1.5 INTCTRL and Protection of Interrupt Masks ............................................................................ 32
2.7.1.6 VLIW and Interrupts ..................................................................................................................... 33
2.7.2 Interrupt and Exception List .................................................................................................................... 34
2.7.3 Interrupt State, Control Registers, Double Faults, and IRET .............................................................. 35
2.7.3.1 Interrupt State and Control Registers ......................................................................................... 35
2.7.3.2 Double Faults ................................................................................................................................. 39
2.7.3.3 IRET ................................................................................................................................................. 40
2.7.4 Interprocessor Interrupt (IPI) .................................................................................................................. 40
2.7.5 Distributed Interrupt Processing ............................................................................................................ 40
2.7.6 Proxying Interrupts ................................................................................................................................... 41
2.7.7 Lower Protection Level Interrupts .......................................................................................................... 41
2.7.8 Downcalls ................................................................................................................................................... 41
2.8 Software-Visible Dynamic Networks ............................................................................................44
2.8.1 Overview .................................................................................................................................................... 44
2.8.1.1 Register Mapping and Interlock .................................................................................................. 44
2.8.1.2 Routing ............................................................................................................................................ 45
2.8.1.3 Demultiplexing .............................................................................................................................. 46
2.8.1.4 Receive-Side Buffering .................................................................................................................. 47
2.8.2 Ordering ...................................................................................................................................................... 47
2.8.2.1 Packet Format ................................................................................................................................. 47
2.8.3 Network Hardwall .................................................................................................................................... 48
2.8.4 Interrupts .................................................................................................................................................... 48
2.8.5 Deadlocks ................................................................................................................................................... 48
2.9 Special Purpose Registers (SPRs) ....................................................................................................49
2.10 Performance Counters / System Diagnostics ..............................................................................49
2.10.1 In-Tile System Devices ............................................................................................................................ 49
2.10.1.1 Tile Timer and AUX_TILE_TIMER ........................................................................................... 49
2.10.1.2 Cycle Counter ............................................................................................................................... 49
2.10.2 Events ........................................................................................................................................................ 49
2.10.3 Counters .................................................................................................................................................... 50
2.10.4 Watch Registers ....................................................................................................................................... 50
2.10.5 Pass SPR .................................................................................................................................................... 50
2.10.6 Broadcast Networks ................................................................................................................................ 50
2.10.7 System Software Debug .......................................................................................................................... 51
2.10.7.1 Tile Debug Port ............................................................................................................................ 51
2.10.7.2 Quiesce .......................................................................................................................................... 56
2.11 Boot Processes and Data Format ....................................................................................................56
iv
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Contents
2.11.1 Boot Flow .................................................................................................................................................. 56
2.11.2 Chip Modes and Reset Behavior ........................................................................................................... 57
2.11.3 Boot FIFO .................................................................................................................................................. 58
CHAPTER 3 DOUBLE DATA RATE SDRAM (DDR3) INTERFACE
3.1 Overview .............................................................................................................................................. 59
3.2 Interfaces .............................................................................................................................................. 60
3.2.1 DDR3 Interface .......................................................................................................................................... 60
3.2.2 Network Interface ..................................................................................................................................... 60
3.3 Data Flows ........................................................................................................................................... 60
3.3.1 QDN Memory Read Request Flow ......................................................................................................... 61
3.3.2 RDN Memory Read Response Flow ....................................................................................................... 61
3.3.3 QDN Memory Write Request Flow ........................................................................................................ 61
3.3.4 RDN Memory Write Response Flow ...................................................................................................... 61
3.3.5 Non-Cacheline Write Flow and Masked Write Flow ........................................................................... 61
3.4 Ordering ............................................................................................................................................... 62
3.4.1 Out of Order Dispatch .............................................................................................................................. 62
3.4.2 Out of Order Response ............................................................................................................................. 62
3.5 Addressing ........................................................................................................................................... 62
3.5.1 Memory Controller Striping .................................................................................................................... 63
3.5.2 DDR Address Mapping (from Memory Address Mapping) .............................................................. 63
3.5.3 Memory Rank/Bank Hashing ................................................................................................................. 64
3.5.4 Logical Rank and Physical Rank Mapping ........................................................................................... 64
3.6 Scheduler ............................................................................................................................................. 64
3.6.1 Memory Page Management Policy ......................................................................................................... 64
3.6.2 Memory Request Reordering .................................................................................................................. 65
3.6.3 Memory Command Reordering .............................................................................................................. 65
3.7 DIMM Support ................................................................................................................................... 65
3.7.1 Serial Presence-Detect EEPROM Support ............................................................................................. 66
3.7.2 Temperature Sensor .................................................................................................................................. 66
3.7.3 Address/Command Parity ...................................................................................................................... 66
3.7.4 RDIMM Control Word Access ................................................................................................................ 66
3.7.5 Memory PHY Training ............................................................................................................................ 66
CHAPTER 4 PCIE CONTROLLER ARCHITECTURE (TRIO)
4.1 Overview .............................................................................................................................................. 67
4.1.1 Communication and Data Transfer ........................................................................................................ 68
4.1.2 PHY Sharing ............................................................................................................................................... 68
4.2 MMIO Interface .................................................................................................................................. 69
4.3 PIO Communication .......................................................................................................................... 70
4.3.1 Memoryless Operation ............................................................................................................................. 70
4.3.2 Ordering ..................................................................................................................................................... 71
4.4 Push DMA ........................................................................................................................................... 71
4.4.1 Descriptors ................................................................................................................................................. 71
4.4.2 Request Partitioning ................................................................................................................................. 73
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
v
Contents
4.4.3 Notification and Flow Control ................................................................................................................ 73
4.4.3.1 Descriptor Rings Slot Available Notification ............................................................................. 73
4.4.3.2 Transaction Complete Notification ............................................................................................. 73
4.4.3.3 PCI System Notification ................................................................................................................ 73
4.4.4 Flush/Fence ................................................................................................................................................ 73
4.5 Pull-DMA .............................................................................................................................................74
4.5.1 Pull DMA Notifications and Flow Control ............................................................................................ 75
4.5.2 Descriptor Rings Slot Available Notification ........................................................................................ 75
4.5.3 Transaction Complete Notification ......................................................................................................... 75
4.5.4 Request Tracker ......................................................................................................................................... 75
4.6 Flush/Fence ..........................................................................................................................................75
4.7 Address Translation ...........................................................................................................................75
4.7.1 I/O MMU ................................................................................................................................................... 76
4.8 Ingress Mapping Regions .................................................................................................................77
4.8.1 Tile Map Memory Regions ....................................................................................................................... 78
4.8.1.1 MAP-MEM Interrupts ................................................................................................................... 78
4.8.1.2 Map-Region Ordering ................................................................................................................... 79
4.8.2 Scatter Queue Regions .............................................................................................................................. 80
4.8.3 Boot and Rshim Regions .......................................................................................................................... 81
4.8.4 Map Fence ................................................................................................................................................... 81
4.9 Panic Mode ...........................................................................................................................................82
4.10 Connection to mPIPE .......................................................................................................................82
4.11 Deadlock .............................................................................................................................................84
CHAPTER 5 PCIE MAC INTERFACE
5.1 Introduction .........................................................................................................................................85
5.2 Register Spaces ....................................................................................................................................86
5.2.1 Type-0/1 and Virtual Function Configuration Space .......................................................................... 87
5.3 Port Configuration ..............................................................................................................................88
5.4 IO Address Mapping .........................................................................................................................88
5.4.1 Boot and Diagnostics Access ................................................................................................................... 88
5.5 Interrupts ..............................................................................................................................................88
5.6 Power Management ............................................................................................................................88
5.7 Link Down Handling .........................................................................................................................89
5.8 SERDES Configuration .....................................................................................................................89
5.9 Streaming Interface ............................................................................................................................89
5.9.1 Packetization .............................................................................................................................................. 90
5.9.2 Interrupts .................................................................................................................................................... 90
5.9.3 Flow Control .............................................................................................................................................. 90
CHAPTER 6 MPIPE ARCHITECTURE
6.1 Overview ..............................................................................................................................................91
6.1.1 Glossary ...................................................................................................................................................... 91
6.1.2 PHY and DMA Sharing ............................................................................................................................ 92
6.1.3 Channelization ........................................................................................................................................... 92
vi
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Contents
6.1.4 Channels vs. Ports ..................................................................................................................................... 93
6.1.5 Priority Queues .......................................................................................................................................... 93
6.1.6 Communication Model ............................................................................................................................ 93
6.2 Ingress Services .................................................................................................................................. 93
6.2.1 Typical Ingress Flow ................................................................................................................................. 93
6.2.2 Buffers ......................................................................................................................................................... 94
6.2.2.1 Buffer Stacks ................................................................................................................................... 95
6.2.2.2 Buffer Chaining .............................................................................................................................. 96
6.2.2.3 Buffer Release ................................................................................................................................. 98
6.2.2.4 Buffer Stack Engine ....................................................................................................................... 99
6.2.3 iDMA Packet Descriptors ....................................................................................................................... 100
6.2.4 Notification Rings ................................................................................................................................... 102
6.2.5 Store-and-Forward vs. Cut-Through .................................................................................................... 103
6.2.6 Classifier ................................................................................................................................................... 103
6.2.6.1 Parallel Processing ....................................................................................................................... 104
6.2.6.2 Cycle Budget ................................................................................................................................ 104
6.2.7 Processor Architecture ............................................................................................................................ 105
6.2.7.1 Header and Descriptor ............................................................................................................... 106
6.2.7.2 Table Lookup ............................................................................................................................... 106
6.2.7.3 Special Registers .......................................................................................................................... 106
6.2.7.4 Hash Accumulator ...................................................................................................................... 107
6.2.7.5 Endianness .................................................................................................................................... 107
6.2.7.6 Header/Descriptor Valid Indicators ........................................................................................ 107
6.2.7.7 Classifier Pipeline ........................................................................................................................ 108
6.2.7.8 Stalls ............................................................................................................................................... 108
6.2.7.9 Persistent State ............................................................................................................................. 109
6.2.7.10 Exceptions ................................................................................................................................... 110
6.2.7.11 Classifier Configuration ........................................................................................................... 110
6.2.7.12 Classifier “Blast” Re/Programming ....................................................................................... 110
6.2.7.13 SPRs ............................................................................................................................................. 112
6.2.7.14 Classifier Tools ........................................................................................................................... 112
6.2.8 iDMA Engine ........................................................................................................................................... 112
6.2.8.1 Temporal Hints for iDMA Writes ............................................................................................. 113
6.2.9 Load Balancer .......................................................................................................................................... 114
6.2.9.1 BucketSTS ..................................................................................................................................... 114
6.2.9.2 Notification Groups ..................................................................................................................... 114
6.2.9.3 Notification Ring Arbitration ..................................................................................................... 115
6.2.9.4 Load Balance Override Flows .................................................................................................... 117
6.2.10 Checksum ............................................................................................................................................... 118
6.2.11 Notification ............................................................................................................................................ 118
6.2.11.1 Tail Pointer Updates – Polling Model .................................................................................... 119
6.2.11.2 Notification Interrupts .............................................................................................................. 119
6.2.11.3 Timestamp and Sequence Number Information .................................................................. 120
6.2.12 Counters ................................................................................................................................................. 120
6.2.13 Software Override Flows ..................................................................................................................... 120
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
vii
Contents
6.2.13.1 Software Classification .............................................................................................................. 120
6.2.13.2 Software Load Balancing .......................................................................................................... 121
6.2.13.3 Software Buffer Management .................................................................................................. 121
6.3 Ingress Channel Flow Control .......................................................................................................121
6.4 Packet Drops ......................................................................................................................................122
6.4.1 Drop/Truncate: iPkt Full ....................................................................................................................... 122
6.4.2 Drop: Classifier Cycle-Budget ............................................................................................................... 122
6.4.3 Drop: Classifier Program ........................................................................................................................ 122
6.4.4 Drop: NotifRing Full ............................................................................................................................... 122
6.4.5 Drop: Bucket Count Full ......................................................................................................................... 122
6.4.6 Drop/Truncate: Out of Buffers ............................................................................................................. 123
6.5 Egress Services ..................................................................................................................................123
6.5.1 Typical Egress Flow ................................................................................................................................ 123
6.5.2 eDMA Packet Descriptors ...................................................................................................................... 124
6.5.2.1 eDMA Descriptor Fetch .............................................................................................................. 125
6.5.2.2 eDMA Descriptor Hunt Mode ................................................................................................... 125
6.5.2.3 Explicit eDMA Descriptor Post .................................................................................................. 126
6.5.2.4 eDMA Descriptor Ring Reordering .......................................................................................... 126
6.5.2.5 Descriptor Prefetch and Memory Ordering ............................................................................. 127
6.5.2.6 Descriptor-Write and Descriptor-Post Ordering .................................................................... 127
6.5.2.7 Ring to Channel Mapping .......................................................................................................... 127
6.5.2.8 Descriptor Errors .......................................................................................................................... 128
6.5.3 Buffers ....................................................................................................................................................... 128
6.5.3.1 Chaining ........................................................................................................................................ 128
6.5.3.2 Descriptor-Based Gather ............................................................................................................. 128
6.5.3.3 Transaction Sizing and Buffer Offsets ...................................................................................... 129
6.5.3.4 Buffer Release ............................................................................................................................... 129
6.5.3.5 Egress VA Translations ............................................................................................................... 130
6.5.4 eDMA Engine ........................................................................................................................................... 130
6.5.5 ePkt Buffering .......................................................................................................................................... 130
6.5.6 Notifications ............................................................................................................................................. 130
6.5.6.1 Descriptor Ring Head .................................................................................................................. 130
6.5.6.2 Descriptor Complete Interrupt and Counter ........................................................................... 131
6.5.7 Checksum ................................................................................................................................................. 131
6.5.7.1 eDMA Checksum Buffer Limitations ....................................................................................... 131
6.5.8 Egress Picker ............................................................................................................................................ 132
6.5.8.1 Egress Priority Arbitration ......................................................................................................... 132
6.5.8.2 Egress Priority Flow Control ...................................................................................................... 132
6.5.9 Special Flows ............................................................................................................................................ 133
6.5.9.1 NoSend Option ............................................................................................................................ 133
6.5.9.2 Size=0 Option ............................................................................................................................... 133
6.5.9.3 eDMA Loopback .......................................................................................................................... 133
6.6 Virtual Memory .................................................................................................................................133
6.6.1 I/O TLB Details ....................................................................................................................................... 134
6.7 PA Distribution .................................................................................................................................135
viii
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Contents
6.7.1 Locality Hints ........................................................................................................................................... 135
6.7.2 Pinning ...................................................................................................................................................... 136
6.8 MMIO ................................................................................................................................................. 136
6.8.1 MAC Configuration Registers ............................................................................................................... 136
6.8.2 Service Domains ...................................................................................................................................... 137
6.9 Interrupts ........................................................................................................................................... 138
6.10 UserIO .............................................................................................................................................. 139
6.11 Flush Mechanisms ......................................................................................................................... 139
6.11.1 MMIO Access Drain .............................................................................................................................. 139
6.11.2 NotifRing Drain ..................................................................................................................................... 139
6.11.3 Ingress Channel Drain .......................................................................................................................... 140
6.11.4 EDMA Ring Drain ................................................................................................................................. 140
CHAPTER 7 XAUI MAC INTERFACE
7.1 Introduction ....................................................................................................................................... 143
7.1.1 Features ..................................................................................................................................................... 143
7.2 Register Spaces ................................................................................................................................. 143
7.3 MAC and Channel Mapping ......................................................................................................... 144
7.4 Port Configuration ........................................................................................................................... 144
7.4.1 Lane Sharing with SGMII ....................................................................................................................... 145
7.5 Flow Control ...................................................................................................................................... 145
7.5.1 Priority-Based Flow Control .................................................................................................................. 145
7.6 Interrupts ........................................................................................................................................... 145
7.7 Timestamping and IEEE 1588 ........................................................................................................ 145
7.8 MDIO ................................................................................................................................................. 151
7.9 Statistics ............................................................................................................................................. 151
7.10 Filtering ............................................................................................................................................ 151
7.10.1 Type ID Checking ................................................................................................................................. 152
7.10.2 Broadcast Address ................................................................................................................................ 152
7.10.3 Hash Addressing ................................................................................................................................... 153
7.11 Special Modes ................................................................................................................................. 153
7.11.1 Pass All Frames Mode .......................................................................................................................... 153
7.11.2 Custom Preamble .................................................................................................................................. 153
7.11.3 Short IPG ................................................................................................................................................ 153
7.12 SERDES Control ............................................................................................................................. 154
7.13 LEDs .................................................................................................................................................. 154
CHAPTER 8 SGMII MAC INTERFACE
8.1 Introduction ....................................................................................................................................... 155
8.1.1 Features ..................................................................................................................................................... 155
8.2 Register Spaces ................................................................................................................................. 155
8.3 MAC and Channel Mapping ......................................................................................................... 156
8.4 Port Configuration ........................................................................................................................... 156
8.4.1 Lane Sharing with XAUI ........................................................................................................................ 156
8.5 Flow Control ...................................................................................................................................... 157
8.5.1 Priority-Based Flow Control .................................................................................................................. 157
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
ix
Contents
8.6 Interrupts ............................................................................................................................................157
8.7 Timestamping and IEEE 1588 .........................................................................................................157
8.8 MDIO ..................................................................................................................................................158
8.9 10/100Mbps Support .........................................................................................................................158
8.10 Half-Duplex Support .....................................................................................................................158
8.11 Energy Efficient Ethernet Support (IEEE 802.3az) ....................................................................158
8.11.1 802.3az Operation .................................................................................................................................. 158
8.11.2 LPI Operation in the MAC ................................................................................................................... 159
8.12 PCS Auto-Negotiation ...................................................................................................................159
8.12.1 PCS Collision Detect and Carrier Sense ............................................................................................. 160
8.12.2 Link Status .............................................................................................................................................. 160
8.13 Statistics ............................................................................................................................................160
8.14 Filtering ............................................................................................................................................160
8.14.1 Type ID Checking .................................................................................................................................. 161
8.14.2 Broadcast Address ................................................................................................................................. 162
8.14.3 Hash Addressing ................................................................................................................................... 162
CHAPTER 9 TILE-GX INTERLAKEN INTERFACE
9.1 Overview ............................................................................................................................................163
9.1.1 Channel Mapping .................................................................................................................................... 163
9.2 TX Interface ........................................................................................................................................163
9.2.1 Burst Scheduler ........................................................................................................................................ 164
9.2.2 Packet vs. Burst ........................................................................................................................................ 164
9.3 RX Interface .......................................................................................................................................164
9.4 Flow Control ......................................................................................................................................164
9.4.1 Link Level TX Flow Control .................................................................................................................. 165
9.4.2 Channel-Based Flow Control ................................................................................................................. 165
9.4.3 Link Level RX Flow Control .................................................................................................................. 165
9.4.4 Out-of-Band Flow Control ..................................................................................................................... 165
9.5 Statistics ..............................................................................................................................................165
9.6 Initialization ......................................................................................................................................166
9.7 Error Handling ..................................................................................................................................166
CHAPTER 10 USB INTERFACE
10.1 Overview ..........................................................................................................................................167
10.2 External I/O Interface .....................................................................................................................168
10.3 Mesh Interface .................................................................................................................................168
10.3.1 MMIO Interface ..................................................................................................................................... 168
10.3.2 Memory Access ...................................................................................................................................... 169
10.3.3 Interrupt Interface ................................................................................................................................. 170
10.4 Host Controller ................................................................................................................................170
10.5 Device Endpoint .............................................................................................................................170
10.5.1 Configuration ......................................................................................................................................... 170
10.5.2 MAC Design ........................................................................................................................................... 170
10.5.3 MAC Interrupts ..................................................................................................................................... 171
10.5.3.1 Device Interrupts ....................................................................................................................... 171
x
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Contents
10.5.3.2 Endpoint Interrupts ................................................................................................................... 171
10.6 Standalone Device Operation ...................................................................................................... 171
10.6.1 Interface and Endpoint Configuration ............................................................................................... 172
10.6.2 Boot/Debug Interface ........................................................................................................................... 172
10.6.3 Tile-Monitor Interface ........................................................................................................................... 173
CHAPTER 11 COMMON ACCELERATOR INTERFACE (MICA)
11.1 Introduction ..................................................................................................................................... 175
11.2 Overview and Major Functional Blocks .................................................................................... 176
11.2.1 Major Blocks ........................................................................................................................................... 177
11.2.1.1 Mesh Interface ............................................................................................................................ 177
11.2.1.2 MMIO Registers and Context State ......................................................................................... 177
11.2.1.3 Operand Data Specification ..................................................................................................... 181
11.2.1.4 TLB (Translation Lookaside Buffer) ........................................................................................ 183
11.2.1.5 Engine Scheduler ....................................................................................................................... 184
11.2.1.6 Function Specific Engines ......................................................................................................... 184
11.2.1.7 DMA Channels .......................................................................................................................... 184
11.2.1.8 PA to Header Generation ......................................................................................................... 184
11.3 Operation Flow ............................................................................................................................... 184
11.3.1 General Flow .......................................................................................................................................... 184
11.3.2 Tile Interrupts ........................................................................................................................................ 185
11.3.3 Specific Use Examples .......................................................................................................................... 186
11.3.3.1 General Use ................................................................................................................................ 186
11.3.3.2 TLB Miss ..................................................................................................................................... 186
11.3.3.3 Deferred Interrupts ................................................................................................................... 187
11.3.3.4 Pause Context ............................................................................................................................. 187
11.3.3.5 TLB Probe ................................................................................................................................... 188
11.3.3.6 TLB Shootdown ......................................................................................................................... 188
11.3.3.7 Terminate Operation for a Specific Context .......................................................................... 188
CHAPTER 12 CRYPTOGRAPHIC ACCELERATOR INTERFACE
12.1 Engines ............................................................................................................................................. 191
12.2 Schedulers ....................................................................................................................................... 191
12.3 Contexts ............................................................................................................................................ 191
12.4 Engine-Specific Details ................................................................................................................. 191
12.4.1 Memory-to-Memory Copy Engine ..................................................................................................... 191
12.4.1.1 Usage Constraints for the Engine ............................................................................................ 191
12.4.2 Crypto Packet Processor ....................................................................................................................... 192
12.4.2.1 Usage Constraints for the Crypto Packet Processor Engine ............................................... 192
12.4.3 KASUMI and SNOW-3G Engine ........................................................................................................ 193
12.4.3.1 KASUMI Engine ........................................................................................................................ 193
12.4.3.2 SNOW-3G Engine ...................................................................................................................... 194
12.4.3.3 Usage Constraints for the Engine ............................................................................................ 195
12.4.4 Public Key Accelerator Engine ............................................................................................................ 195
12.4.4.1 Descriptor Ring Management .................................................................................................. 196
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
xi
Contents
12.4.4.2 Command Descriptor Contents ............................................................................................... 197
12.4.4.3 Result Descriptor Contents ....................................................................................................... 197
12.4.4.4 Interrupts .................................................................................................................................... 197
CHAPTER 13 COMPRESSION ACCELERATOR INTERFACE
13.1 Overview ..........................................................................................................................................199
13.2 Data Flows ........................................................................................................................................199
13.2.1 Typical Compression Flow .................................................................................................................. 199
13.2.2 Typical Decompression Flow .............................................................................................................. 200
13.3 Compression Engine ......................................................................................................................200
13.3.1 Engine Configuration ........................................................................................................................... 200
13.3.2 GZIP Handling ...................................................................................................................................... 201
13.4 Decompression Engine ..................................................................................................................201
13.4.1 GZIP Handling ...................................................................................................................................... 201
13.5 Memory-to-Memory Copy ............................................................................................................201
13.6 API .....................................................................................................................................................201
13.6.1 Context Registers ................................................................................................................................... 202
13.6.2 Compression/Decompression Engine Registers .............................................................................. 202
13.6.3 Status Registers ...................................................................................................................................... 203
13.6.4 Transaction Size ..................................................................................................................................... 203
13.6.5 Data Expansion Handling .................................................................................................................... 203
13.6.6 Performance Counter ............................................................................................................................ 204
CHAPTER 14 FLEXIBLE I/O INTERFACE
14.1 Overview ..........................................................................................................................................205
14.2 Virtualization and Protection Support .......................................................................................205
14.3 MMIO Register Map ......................................................................................................................206
14.4 Interrupts ..........................................................................................................................................206
14.5 I/O Pin Driver Configuration .......................................................................................................206
14.6 I/O Pin Clocking Control ..............................................................................................................206
14.7 Pin Control and Data Accesses ....................................................................................................207
14.8 Reset/Initialization .........................................................................................................................207
14.9 Performance .....................................................................................................................................207
CHAPTER 15 RSHIM INTERFACES
15.1 Level-1 Boot .....................................................................................................................................209
15.2 I/O Discovery ...................................................................................................................................209
15.3 tile-monitor FIFOs ..........................................................................................................................209
15.4 Down-Counters and Watchdog ....................................................................................................210
15.5 Rshim JTAG .....................................................................................................................................210
15.6 Reset Control ...................................................................................................................................210
15.7 Byte Access Interface ......................................................................................................................210
15.8 Remote Interface Access and Device Protection .......................................................................210
xii
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Contents
CHAPTER 16 UART INTERFACES
16.1 UART Interface ............................................................................................................................... 211
16.1.1 Overview ................................................................................................................................................ 211
16.1.1.1 Protocol Mode ............................................................................................................................ 211
16.1.2 Data Flows .............................................................................................................................................. 213
16.1.2.1 Receiving Data ........................................................................................................................... 213
16.1.2.2 Transmitting Data ...................................................................................................................... 213
16.1.3 Flow Control .......................................................................................................................................... 213
16.1.4 Master Arbitration ................................................................................................................................. 213
16.1.5 8/64 Bits Handling ................................................................................................................................ 214
16.1.5.1 Remote UART Writes ................................................................................................................ 214
16.1.5.2 Remote UART Reads ................................................................................................................ 214
16.1.6 Error Handling and Interrupts ............................................................................................................ 214
16.1.7 UART Controller Registers .................................................................................................................. 214
CHAPTER 17 I2C MASTER INTERFACE
17.1 Overview .......................................................................................................................................... 215
17.1.1 I2C Master Boot Options ...................................................................................................................... 215
17.1.2 Boot ROM Format ................................................................................................................................. 216
17.1.3 Boot Operations ..................................................................................................................................... 217
17.2 Usage Model .................................................................................................................................... 217
17.2.1 Generic Operation ................................................................................................................................. 217
17.2.2 Software Instructions ............................................................................................................................ 220
17.2.3 I2C EEPROM Page Mode ..................................................................................................................... 222
17.2.4 Error Handling and Interrupts ............................................................................................................ 222
17.3 Registers ........................................................................................................................................... 222
CHAPTER 18 I2C SLAVE INTERFACE
18.1 Overview .......................................................................................................................................... 223
18.2 Usage Model .................................................................................................................................... 223
18.2.1 Data Flows .............................................................................................................................................. 223
18.2.2 Direct-Addressing ................................................................................................................................. 224
18.2.3 No-Address Access ............................................................................................................................... 224
18.2.4 8 Bits / 64 Bits Handling ...................................................................................................................... 224
18.2.5 Acknowledge Control ........................................................................................................................... 225
18.2.6 Access Arbitration ................................................................................................................................. 225
18.2.7 Error Handling and Interrupts ............................................................................................................ 225
CHAPTER 19 SPI INTERFACE
19.1 Overview .......................................................................................................................................... 227
19.1.1 Boot Options .......................................................................................................................................... 227
19.1.2 Boot ROM Format ................................................................................................................................. 227
19.2 Usage Model .................................................................................................................................... 229
19.2.1 Boot Operation ....................................................................................................................................... 229
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
xiii
Contents
19.2.2 SPI Flash Operations ............................................................................................................................. 229
19.2.2.1 SPI Flash Instructions ................................................................................................................ 229
19.2.2.2 SPI Configurable Instruction Sets ............................................................................................ 230
19.2.2.3 SPI Flash Unknown Instruction ............................................................................................... 230
19.2.2.4 SPI Flash Deep Power-Down ................................................................................................... 230
19.2.2.5 SPI Flash Write In-Progress ...................................................................................................... 230
19.2.2.6 SPI Flash Write Protection ........................................................................................................ 230
19.2.2.7 SPI Flash Page Mode ................................................................................................................. 231
19.2.2.8 SPI Flash Interface ...................................................................................................................... 231
19.2.2.9 Software Command Sequences to Execute an SPI Flash Instruction ................................. 231
19.2.2.10 Interface Timing ....................................................................................................................... 233
19.2.3 Rshim Interface ...................................................................................................................................... 234
19.2.3.1 Rshim Register Interface ........................................................................................................... 234
19.2.3.2 Rshim Host Interface ................................................................................................................. 234
19.2.3.3 Error Handling and Interrupts ................................................................................................ 234
APPENDIX A JTAG INTERFACE
APPENDIX B CLASSIFIER INSTRUCTIONS AND SPRS
B.1 Classifier Instructions .....................................................................................................................237
B.1.1 Arithmetic Instructions .......................................................................................................................... 238
B.1.2 Comparison Instructions ....................................................................................................................... 240
B.1.3 Control Instructions ................................................................................................................................ 242
B.1.4 Logical Instructions ................................................................................................................................ 243
B.1.5 Miscellaneous Instructions .................................................................................................................... 245
B.2 Registers .............................................................................................................................................246
B.2.1 Register Summary ................................................................................................................................... 246
B.2.2 Register Definitions ................................................................................................................................ 247
APPENDIX C MISCELLANEOUS ACCELERATOR SPECIFICATIONS
C.1 SNOW-3G Engines ..........................................................................................................................255
C.1.1 Specification Summary .......................................................................................................................... 255
C.1.2 Performance ............................................................................................................................................. 256
C.1.2.1 Introduction ................................................................................................................................. 256
C.1.3 Functional Description ........................................................................................................................... 257
C.1.3.1 SNOW Key Stream Generator ................................................................................................... 257
C.1.4 Feedback Logic and XOR ...................................................................................................................... 258
C.1.5 Examples .................................................................................................................................................. 259
C.1.6 Operations ............................................................................................................................................... 260
C.1.6.1 General Operations ..................................................................................................................... 260
C.1.6.2 Encryption Modes: UEA2 / 128-EEA1 .................................................................................... 260
C.2 KASUMI Engines ............................................................................................................................261
C.2.1 Introduction ............................................................................................................................................. 261
C.2.1.1 Specification Summary .............................................................................................................. 261
C.2.2 KASUMI Engine Functional Description ............................................................................................ 261
xiv
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Contents
C.2.2.1 General Processing ..................................................................................................................... 261
C.2.2.2 Examples ...................................................................................................................................... 262
C.3 Packet Processor — Programming ............................................................................................... 264
C.3.1 Introduction ............................................................................................................................................. 264
C.3.1.1 Purpose ......................................................................................................................................... 264
C.3.1.2 Scope ............................................................................................................................................. 264
C.3.1.3 Abbreviation and Definitions ................................................................................................... 264
C.3.1.4 Data Flow Table .......................................................................................................................... 264
C.3.2 ARC4 Algorithm ..................................................................................................................................... 265
C.3.3 AES-CCM for Basic Operations and IPSec Protocols ........................................................................ 266
C.3.3.1 Introduction ................................................................................................................................. 266
C.3.3.2 Authentication ............................................................................................................................. 267
C.3.3.3 Encryption .................................................................................................................................... 268
C.3.3.4 Implementation ........................................................................................................................... 269
C.3.3.5 Basic Operation ........................................................................................................................... 270
C.3.3.6 ESP ................................................................................................................................................ 276
C.3.4 AES-GMAC/AES-GCM for Basic Operations and IPSec Protocols ............................................... 279
C.3.4.1 Introduction ................................................................................................................................. 279
C.3.4.2 Basic Operation ........................................................................................................................... 279
C.3.4.3 IPSec .............................................................................................................................................. 281
C.3.5 SRTP/SRTCP Protocols ......................................................................................................................... 289
C.3.5.1 Introduction ................................................................................................................................. 289
C.3.5.2 Packet Format .............................................................................................................................. 289
C.4 Context Control Words ................................................................................................................... 290
C.4.0.1 Outbound Processing ................................................................................................................. 292
C.4.0.2 Inbound Processing .................................................................................................................... 293
C.4.1 MACsec Protocol .................................................................................................................................... 295
C.4.1.1 Introduction ................................................................................................................................. 295
C.4.1.2 Packet Format .............................................................................................................................. 296
C.4.1.3 Context Control Words .............................................................................................................. 297
C.4.1.4 Outbound Processing ................................................................................................................. 299
C.4.1.5 Inbound Processing .................................................................................................................... 301
C.4.2 DTLS Protocol ......................................................................................................................................... 303
C.4.2.1 Introduction ................................................................................................................................. 303
C.4.2.2 Supported Features .................................................................................................................... 303
C.4.2.3 Packet Format .............................................................................................................................. 304
C.4.2.4 Context Control Words .............................................................................................................. 304
C.4.2.5 Outbound Processing ................................................................................................................. 306
C.4.2.6 Inbound Processing .................................................................................................................... 309
C.4.3 SSL/TLS Protocol ................................................................................................................................... 312
C.4.3.1 Introduction ................................................................................................................................. 312
C.4.3.2 Supported Features .................................................................................................................... 313
C.4.3.3 Packet Format .............................................................................................................................. 313
C.4.3.4 Context Control Words .............................................................................................................. 314
C.4.3.5 SSL MAC ...................................................................................................................................... 315
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
xv
Contents
C.4.3.6 Outbound Processing ................................................................................................................. 316
C.4.3.7 Inbound Processing .................................................................................................................... 319
C.5 Public Key Accelerator (PKA) .......................................................................................................322
C.5.1 PKA Firmware Architecture Overview ............................................................................................... 322
C.5.2 Command and Vector Copy and Zeroization .................................................................................... 324
C.5.3 PKI Command Interface ........................................................................................................................ 326
C.5.4 Main PKI Command Interface .............................................................................................................. 326
C.5.4.1 Descriptor Ring Management ................................................................................................... 326
C.5.4.2 Descriptor Ring Control/Status Words ................................................................................... 326
C.5.5 PKI Command and Result Descriptors ............................................................................................... 332
C.5.5.1 Command Descriptor Contents ................................................................................................ 332
C.5.5.2 Result Descriptor Contents........................................................................................................ 334
C.5.5.3 PKI Command/Result Specifics (Firmware Dependent) ..................................................... 337
C.5.5.4 Restrictions on PKA Operations ............................................................................................... 353
C.5.6 PKI Key Decrypt Key Management Interface .................................................................................... 359
C.5.6.1 AES Byte Order Example ........................................................................................................... 361
C.5.6.2 PKI Key Decrypt Keys Storage (PKI_KDK_0_[0:7] … _3_[0:7]) ........................................... 362
C.5.6.3 PKI Key Decrypt IVs Storage (PKI_KD_IV_0_[0:3] … _3_[0:3]) .......................................... 364
C.5.6.4 PKI Key Decrypt CTR Mode Increment Storage (PKI_KD_INCR_0 … _3) ....................... 366
C.5.6.5 PKI Key Decrypt Key Control Words ...................................................................................... 368
C.5.7 PKI Engine Boot-Up and Internal Error Reporting ........................................................................... 369
C.6 Conventions Used in this Manual ................................................................................................370
C.6.1 Register Information .............................................................................................................................. 370
APPENDIX D INLINE PACKET ENGINE
D.1 Crypto Packet Processor Processing Overview .........................................................................371
D.1.1 Crypto Packet Processor Terms ........................................................................................................... 371
D.1.1.1 Tokens .......................................................................................................................................... 372
D.1.1.2 Context ......................................................................................................................................... 372
D.2 Configuring the Crypto Packet Processor ..................................................................................372
D.2.1 Enabling Protocol and Algorithm Support ........................................................................................ 372
D.2.2 Context Fetch Modes ............................................................................................................................. 372
D.2.3 Packet Processing Modes ...................................................................................................................... 373
D.3 Pseudo Random Number Generator ...........................................................................................375
D.3.1 Purpose .................................................................................................................................................... 375
D.3.2 Architecture ............................................................................................................................................. 376
D.3.3 Functional Description .......................................................................................................................... 376
D.3.4 Generation of DT .................................................................................................................................... 377
D.3.5 Generation of Keys ................................................................................................................................. 378
D.3.6 Performance ............................................................................................................................................ 378
D.4 Input Token Definition ..................................................................................................................379
D.4.1 Introduction ............................................................................................................................................ 379
D.4.2 Input Token Diagram ............................................................................................................................ 379
D.4.2.1 Input Token Header ................................................................................................................... 379
D.5 Processing Instructions ..................................................................................................................392
xvi
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Contents
D.5.1 Instruction Types .................................................................................................................................... 392
D.5.1.1 Operational Data Instructions (Type 1) ................................................................................... 392
D.5.1.2 IP Header Instructions (Type 2) ............................................................................................... 392
D.5.1.3 Post-Process Instructions (Type 3) ........................................................................................... 393
D.5.1.4 Result Instructions (Type 4) ...................................................................................................... 393
D.5.1.5 Context Control Instructions (Type 5) ..................................................................................... 393
D.5.1.6 Special Instructions (Type 6) ..................................................................................................... 393
D.5.2 Instruction Sequencing .......................................................................................................................... 393
D.5.2.1 Sequencing Rules ........................................................................................................................ 393
D.5.3 Instruction Format ................................................................................................................................. 394
D.5.4 Operational Data Instructions .............................................................................................................. 396
D.5.4.1 Direction Instruction .................................................................................................................. 396
D.5.4.2 PRE_CHECKSUM Instruction .................................................................................................. 397
D.5.4.3 INSERT Instruction .................................................................................................................... 398
D.5.4.4 INSERT Instruction Example – NOP ....................................................................................... 401
D.5.4.5 INSERT_CTX Instruction .......................................................................................................... 402
D.5.4.6 REPLACE Instruction ................................................................................................................ 403
D.5.4.7 RETRIEVE Instruction ............................................................................................................... 403
D.5.4.8 MUTE Instruction ....................................................................................................................... 404
D.5.5 IP Header Instructions ........................................................................................................................... 405
D.5.5.1 IPv4 Instruction ........................................................................................................................... 406
D.5.5.2 IPv4_CHECKSUM Instruction ................................................................................................. 407
D.5.5.3 IPv6 Instruction ........................................................................................................................... 407
D.5.6 Post-Process Instructions ...................................................................................................................... 408
D.5.6.1 INSERT_REMOVE_RESULT (IRR) Instruction ..................................................................... 409
D.5.6.2 REPLACE_BYTE Instruction .................................................................................................... 423
D.5.6.3 Reserved Instructions ................................................................................................................ 424
D.5.7 Result Instructions ................................................................................................................................. 424
D.5.7.1 VERIFY_FIELDS Instruction ..................................................................................................... 424
D.5.8 Context Control Instructions ................................................................................................................ 425
D.5.8.1 CONTEXT_ACCESS Instruction .............................................................................................. 425
D.5.9 Bypass Token Data – Special Instruction ............................................................................................ 429
D.6 Result Token Definition ............................................................................................................... 429
D.7 Pre and Post-Processing by Host Software ................................................................................ 432
D.7.1 Preprocessing .......................................................................................................................................... 432
D.7.2 Post-Processing ....................................................................................................................................... 433
D.7.2.1 Result Token ................................................................................................................................ 433
D.7.2.2 Appended Data ........................................................................................................................... 433
D.7.2.3 Suggested Post-Processing Operations ................................................................................... 433
D.8 Context Record Definition ............................................................................................................ 433
D.8.1 Context Record Format ......................................................................................................................... 433
D.8.2 Context Control Words ......................................................................................................................... 436
D.8.2.1 Control Word 0 Field Encoding ............................................................................................... 436
D.8.2.2 Context Control Word 1 Definition ......................................................................................... 439
D.8.2.3 Key ................................................................................................................................................ 443
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
xvii
Contents
D.8.2.4 Hash Digest ................................................................................................................................. 444
D.8.2.5 Security Parameter Index .......................................................................................................... 448
D.8.2.6 Sequence Number Processing ................................................................................................... 449
D.8.2.7 IV Data .......................................................................................................................................... 452
D.8.3 Examples for Most Common Scenarios .............................................................................................. 453
D.8.3.1 Basic Encryption Operation ...................................................................................................... 453
D.8.3.2 Basic Hash Operation ................................................................................................................. 454
D.8.3.3 Combined Basic Encrypt Hash Operations ............................................................................ 454
D.8.3.4 IPSec Hash Only Operation ...................................................................................................... 454
D.8.3.5 IPSec Encryption Only Operation ............................................................................................ 454
D.8.3.6 IPSec Combined Encryption Hash Operation ........................................................................ 454
D.8.3.7 Typical Initialization Values for IPSec Operations ................................................................ 455
D.9 Register and Memory Map ............................................................................................................456
D.9.1 Configuration Registers ......................................................................................................................... 463
D.9.1.1 Packet Engine Token Control ................................................................................................... 464
D.9.1.2 Packet Engine Context Control ................................................................................................. 466
D.9.1.3 Packet Engine Interrupts ........................................................................................................... 469
D.9.1.4 Packet Engine Data Fetch Control ............................................................................................ 470
D.9.1.5 Crypto Packet Processor Input and Output Transfer Control/Status Register ................ 471
D.9.1.6 Packet Engine Configuration .................................................................................................... 473
D.9.2 PRNG Registers ...................................................................................................................................... 474
D.9.2.1 PRNG Seed Register (PRNG_SEED_L, PRNG_SEED_H) .................................................... 476
D.9.2.2 PRNG DES Key Registers (PRNG_KEY0_L, PRNG_KEY0_H, PRNG_KEY1_L,
PRNG_KEY1_H) ............................................................................................................................................. 476
D.9.2.3 PRNG Output Registers (PRNG_RES0_L, PRNG_RES0_H, PRNG_RES1_L,
PRNG_RES1_H) .............................................................................................................................................. 478
D.9.2.4 PRNG LFSR Registers (PRNG_LFSR_L, PRNG_LFSR_H) ................................................... 480
D.10 Protocol Compliancy .....................................................................................................................481
D.10.1 Introduction .......................................................................................................................................... 481
D.10.2 Disclaimer .............................................................................................................................................. 481
D.10.3 IP Header ............................................................................................................................................... 481
D.10.4 ESP Processing ...................................................................................................................................... 482
D.10.5 AH Processing ...................................................................................................................................... 483
D.10.6 SSL Processing ...................................................................................................................................... 485
D.10.7 TLS Processing ...................................................................................................................................... 486
D.10.8 DTLS Processing ................................................................................................................................... 487
D.10.9 SRTP/SRTCP Processing .................................................................................................................... 488
D.10.10 MACsec Processing ............................................................................................................................ 488
APPENDIX E INLINE PACKET ENGINE — TOKEN EXAMPLES
E.1 Introduction .......................................................................................................................................491
E.1.1 Purpose ..................................................................................................................................................... 491
E.2 Token Examples — Basic Operations ..........................................................................................491
E.2.1 Bypass Packet Token (IPv4) ................................................................................................................... 492
E.2.2 Bypass Packet Token (IPv6) ................................................................................................................... 493
E.2.3 ESP Outbound Packet Token (IPv4, Transport Mode) ...................................................................... 494
xviii
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Contents
E.2.3.1 With CBC Mode .......................................................................................................................... 494
E.2.3.2 With CTR Mode ........................................................................................................................... 495
E.2.3.3 With GCM .................................................................................................................................... 496
E.2.4 ESP Inbound Packet Token (IPv4, Transport Mode) ......................................................................... 498
E.2.4.1 With CBC Mode .......................................................................................................................... 498
E.2.4.2 With CTR Mode ........................................................................................................................... 500
E.2.4.3 With GCM .................................................................................................................................... 502
E.2.5 ESP Inbound Packet Token (IPv4, Transport Mode, Jumbo) ........................................................... 504
E.2.5.1 With CBC Mode .......................................................................................................................... 504
E.2.5.2 With GCM .................................................................................................................................... 506
E.2.6 ESP Outbound Packet Token (IPv4, Tunnel Mode) ........................................................................... 508
E.2.7 ESP Inbound Packet Token (IPv4, Tunnel Mode) .............................................................................. 509
E.2.8 ESP Outbound Packet Token (IPv6, Transport Mode) ...................................................................... 511
E.2.9 ESP Inbound Packet Token (IPv6, Transport Mode) ......................................................................... 513
E.2.10 AH Outbound Packet Token (IPv4, Transport Mode) .................................................................... 515
E.2.11 AH Inbound Packet Token (IPv4, Transport Mode) ....................................................................... 517
E.2.12 AH Outbound Packet Token (IPv4, Tunnel Mode) ......................................................................... 519
E.2.13 AH Outbound Packet Token (IPv4, Tunnel Mode, Jumbo) ........................................................... 521
E.2.14 AH Inbound Packet Token (IPv4, Tunnel Mode) ............................................................................ 524
E.2.15 AH Inbound Packet Token (IPv4, Tunnel Mode) Using Mute Instruction .................................. 526
E.2.16 AH Outbound Packet Token with Routing Extension Header (IPv6, Transport Mode) ........... 527
E.2.17 AH Outbound Packet Token with Multiple Extension Headers (IPv6, Transport Mode) ........ 529
E.2.18 AH Inbound Packet Token (IPv6, Transport Mode) ....................................................................... 532
E.2.19 AH Outbound Packet Token (IPv6, Tunnel Mode) ......................................................................... 535
E.2.20 AH Inbound Packet Token (IPv6, Tunnel Mode) ............................................................................ 537
E.2.21 sRTP Outbound — Packet Token (IPv4 — UDP — RTP) ............................................................... 539
E.2.22 sRTP Inbound — Packet Token (IPv4 — UDP — RTP) .................................................................. 541
E.2.23 Simple Token Examples ....................................................................................................................... 543
E.3 Token Examples - Advanced Operations .................................................................................... 544
E.3.1 Basic Processing ...................................................................................................................................... 544
E.3.1.1 Outbound ARC4 .......................................................................................................................... 544
E.3.2 ESP ............................................................................................................................................................ 545
E.3.2.1 ESP Outbound Packet Token (IPv4, Transport Mode, AES-CCM) ...................................... 545
E.3.2.2 ESP Inbound Packet Token (IPv4, Transport Mode, AES-CCM) ......................................... 547
E.3.2.3 ESP Outbound Packet Token (IPv4, Transport Mode, AES-GMAC) .................................. 549
E.3.2.4 ESP Inbound Packet Token (IPv4, Transport mode, AES-GMAC) ...................................... 551
E.3.2.5 ESP Outbound Packet Token (IPv4, Transport Mode with Encryption and SHA-2 Authentication) ............................................................................................................................................................... 553
E.3.2.6 ESP Inbound Packet Token (IPv4, Transport Mode with Encryption and SHA-2 Authentication) ................................................................................................................................................................... 555
E.3.3 AH ............................................................................................................................................................. 557
E.3.3.1 AH Outbound Packet Token (IPv4, Transport Mode, AES-GMAC) ................................... 557
E.3.3.2 AH Inbound Packet Token (IPv4, Transport Mode, AES-GMAC) ...................................... 559
E.3.4 SSL ............................................................................................................................................................. 561
E.3.4.1 Introduction ................................................................................................................................. 561
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
xix
Contents
E.3.4.2 SSL Outbound Packet Token (ARC4) ....................................................................................... 561
E.3.4.3 SSL Inbound Packet Token (ARC4) .......................................................................................... 563
E.3.5 DTLS ......................................................................................................................................................... 565
E.3.5.1 Introduction ................................................................................................................................. 565
E.3.5.2 DTLS Outbound Packet Token (AES-CBC) ............................................................................. 565
E.3.5.3 DTLS Inbound Packet Token (AES-CBC) ................................................................................ 567
E.3.6 MACsec .................................................................................................................................................... 569
E.3.6.1 MACsec Outbound Packet Token (AES-GCM) ...................................................................... 569
E.3.6.2 MACsec Inbound Packet Token (AES-GCM) ......................................................................... 571
E.3.7 SRTCP ....................................................................................................................................................... 573
E.3.7.1 SRTCP Outbound Packet Token (AES-ICM) ........................................................................... 573
E.3.7.2 SRTCP Inbound Packet Token (AES-ICM) .............................................................................. 575
APPENDIX F REFERENCES
F.1 KASUMI References ........................................................................................................................577
F.2 Packet Processor References ...........................................................................................................577
F.3 SNOW-3G References .....................................................................................................................578
F.4 Public Key Accelerator References ...............................................................................................578
F.4.1 Open Specifications and Standards ...................................................................................................... 578
F.5 Inline Packet Engine (Token Example) References ...................................................................582
F.6 SGMII MAC Interface .....................................................................................................................584
GLOSSARY, CONVENTIONS AND STANDARDS ..................................................................... 585
G.1 Conventions and Standards ..........................................................................................................585
G.2 Glossary .............................................................................................................................................591
INDEX ................................................................................................................................ 597
xx
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
P REFACE
About this Manual
This manual describes the interfaces supported in the TILE-Gx™ processor.
Intended Audience
This manual is intended for use by hardware engineers.
For information how to access I/O devices via software, refer to MDE System Programmer’s Guide
UG509
Manual Contents Description
Chapters are grouped in sections called parts, based on the interface type: memory, high speed
interfaces, common interfaces, and so on. This manual is organized as follows:
•
The Preface provides an overview of this manual, information about contacting customer support, and general instructions about how registers are described.
•
Chapter 1: I/O Device Introduction. This chapter identifies the registers that are common to all
supported interfaces.
•
Chapter 2: Tile Processor. This chapter provides a detailed description of the Tile Processor’s
system architecture. It describes memory, interrupts, communication within a processor via the
software-visible dynamic networks, Special Purpose Registers (SPRs), and in-tile system
devices such as counters and timers.
•
Chapter 3: Double Data Rate SDRAM (DDR3) Interface. This chapter provides a detailed
description of the memory controller, how the memory interface manages data flow, data
ordering, performance features, how errors are handled, interrupts and memory interface registers.
•
Chapter 4: PCIe Controller Architecture (TRIO) This chapter describes how to integrate the Tile
processors with a PCI system.
•
Chapter 5: PCIe MAC Interface. This chapter describes booting, deadlock avoidance, power
management, and PCIe registers.
•
Chapter 6: mPIPE Architecture. This chapter describes line rate services for the packet interfaces.
•
Chapter 7: XAUI MAC Interface. This chapter describes the XAUI MAC interface, flow control,
interrupts, and registers.
•
Chapter 8: SGMII MAC Interface. This chapter describes the SGMII MAC interface, flow control, interrupts, and registers.
•
Chapter 9: TILE-Gx Interlaken Interface. This chapter describes the TILE-Gx Interlaken port,
which provides a channelized packet interface between mPIPE and 1 or more SERDES lanes.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
xxi
Preface
xxii
•
Chapter 10: USB Interface. This chapter describes the SGMII MAC interface, flow control, interrupts, and registers.
•
Chapter 11: Common Accelerator Interface (MiCA). This chapter describes the architecture of
the TILE-Gx™ Multicore iMesh Coprocessing Accelerator (MiCA™). The MiCA provides a
common front-end (both SW and HW) to various IO off-load or acceleration functions, for
example Crypto or Compression.
•
Chapter 12: Cryptographic Accelerator Interface. This chapter describes the TILE-Gx Crypto
implementation of Multicore iMesh Coprocessing Accelerator (MiCA™).
•
Chapter 13: Compression Accelerator Interface. This chapter describes those features unique to
the compression functionality (MiCA).
•
Chapter 14: Flexible I/O Interface. This chapter provides an overview of the controller architecture used to manage the flexible I/O pins.
•
Chapter 15: Rshim Interfaces. This chapter describes the Rshim, which contains chip-level services for booting and debugging. It also hosts a number of the low speed interfaces including
UARTs, I2C-Masters, I2C-Slave, and serial peripheral interface (SPI). These interfaces are
described in the chapters 14 through 17.
•
Chapter 16: UART Interfaces. This chapter describes the UART device interface. It describes
boot options, flow control, error handling and associated interrupts, and registers.
•
Chapter 17: I2C Master Interface. This chapter describes I2C Master Interface, which provides
an interface for tiles to write and read an external I2C devices.
•
Chapter 18: I2C Slave Interface. This chapter describes the I2C Slave Interface, which is the
interface to an external I2C device.
•
Chapter 19: SPI Interface. This chapter describes the SPI SROM interface, which provides an
interface for tiles to write and read an off-chip SPI SROM.
•
Appendix A:: JTAG Interface. This appendix describes the JTAG interface, which provides an
instruction register for reading and writing configuration registers within the Rshim.
•
Appendix B:: Classifier Instructions and SPRs. This appendix provides additional information
about the classifier instructions and special purpose registers referenced in Chapter 6: mPIPE
Architecture.
•
Appendix C:: Miscellaneous Accelerator Specifications. This appendix provides additional
information about the four types of accelerators included with the TILE-Gx™ family of processors.
•
Appendix D:: Inline Packet Engine. This appendix provides a description of the crypto packet
processor, information on how to configure it, and descriptions of the Pseudo Random Number
Generator and input tokens.
•
Appendix E:: Inline Packet Engine — Token Examples. This appendix describes the format and
use of sample input tokens.
•
Appendix F:: References. This appendix lists source materials and additional publications.
•
Glossary, Conventions and Standards.
•
Index.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Preface
Related Documents
Additional documentation from Tilera Corporation is available including the following:
•
Multicore Development Environment Release Notes, UG208
•
TILE-GX36 Tile Processor™ Preliminary Data Sheet, DS400
•
TILE-Gx Instruction Set Architecture Specification (UG401)
Technical or Customer Support
You can reach Tilera Corporation Customer Support in the following ways:
•
Visit the Tilera Web site at
http://www.tilera.com/
•
E-mail questions to
support@tilera.com
Product Information
You can obtain product information from the Tilera Corporation Web site, from the product CDROM, or from the printed publications (manuals).
Tilera Corporation is online at http://www.tilera.com/.
Notation Conventions
Text conventions used in this manual are as follows:
Table 1. Notation Conventions
Example
Description
Close command (File menu)
Titles in reference sections indicate the location of an item within the IDE’s menu
system (for example, the Close command appears on the File menu).
Write In Progress (WIP) bit
Courier text indicates the names of:
• A bit or bitfield, for example the CHANNEL bit
• A special purpose register (SPR), for example the RSH_COORD SPR
• Code sample
• Application, for example the tile-monitor application
• Command, for example the link command
• State, for example the IDLE state
RSHIM MMIO registers
Blue and underlined text in Courier font indicates these are hypertext links to
HTML files associated with the interface, or links to a specific web site.
Chapter 7: XAUI MAC Interface
Blue text in text font (Palatino) indicates a (cross-reference) Hypertext link to text
elsewhere in the manual. If you are reading a soft-copy of this manual, you can click
on the link to jump directly to the referenced section.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
xxiii
Preface
Table 1. Notation Conventions
Example


Description
Note: For correct operation, ...
A Note provides supplementary information on a related topic. In the online version
of this book, the word Note appears instead of this symbol.
Caution: Incorrect device operation might result if ...
Caution: Device damage might result if ...
A Caution identifies conditions or inappropriate usage of the product that could lead
to undesirable results or product damage. In the online version of this book, the
word Caution appears instead of this symbol.
Conventions for Register Descriptions
Several notational conventions are utilized in this document. The following section describes
these conventions.
Conventions for Processor Families
Registers for each interface are described with the following:
•
A narrative that includes addressing information.
•
Register diagrams.
•
Register bit description tables, like the one shown in Figure 1.
Figure 1. Sample Bit Description Table
Byte and Bit Order
The Tile Processor Architecture is little-endian. When sets of bytes are described or displayed in
this document, bytes with more significance are displayed to the left of bytes with lesser significance. More significant bytes are always numbered with a higher number than less significant
bytes (LSBs). When data is stored in memory, bytes that are of greater significance are stored in
higher numbered memory addresses than bytes of less significance.
When sets of bits are described or displayed in this document, bits of higher significance are displayed to the left of bits with lower significance. For instance, if 32 bits are to be displayed and are
numbered from 0 to 31, bit 31 is displayed to the left of bit 0. When groups of bits are operated on
xxiv
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Preface
as integers, the Tile Processor Architecture carries from the less significant bits to the more significant bits when adding. Bits numbered with a higher number have greater significance than bits
with a lower number.
Reserved Fields
Bit fields in registers for computational architectures are many times unused. In the Tile Processor
Architecture, unused bits are considered reserved zero (reserved 0). When bits labeled as
reserved 0 are read they are not guaranteed to return zero. Bits denoted as reserved 0, must be
written as zero. Bits that are ignored by the hardware are explicitly called out as being writeignored.
Numbering
The default base used in this document is base ten, or decimal representation. Any use of a bare
numeric is considered to be a decimal number. Hexadecimal numbering is also used widely in
this document. When a numeric is to be interpreted as a hexadecimal (base sixteen) number, the
prefix “0x” is prepended to the number. For example, the number 74 can also be expressed as
0x4A when written in hexadecimal.
When ranges of bits are numbered as a subset of a larger set of ordered bits a bracket notation is
used. The notation contains one or two numbers separated by a colon. If only one number is specified, the numbered bit position is the bit referenced. In example, if “bus” is a 32-bit bus that is
numbered 31 to 0 and the text describes bit 5, bus[5] is appropriate nomenclature to signify that
bit. For bit ranges, the left number is the higher-ordered bit location and the right number is the
lower-ordered bit location. Bit ranges are inclusive. This nomenclature is consistent with the
default manner in which little-endian bit ranges are denoted. For example, if word is a 32-bit
word numbered 31 to 0 and the text describes the bits from bit 5 to bit 20, the appropriate manner
to denote that is word[20:5].
Figure 2 shows an example of how bitfields are graphically presented in this document.
Bits[31:21] are shown as reserved bits.
31
21 20
5 4
0
First Field
Second Field
Third Field
Figure 2. Bitfield Example
Figure 3 shows four bitfields that are logically represented along with a gap. The gap is not
reserved, but is instead allocated by another function.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
xxv
Preface
26 25 24 23 22 21 20
58 57 56 55 54 53 52 51
010
s
s
d
SrcBDest_Y2 - Dest
SrcA_Y2[0:0] - Src[0:0]
SrcA_Y2[5:1] - Src[5:1]
Opcode_Y2 - 0x2
Figure 3. Bitfield Example with Fields Allocated by Other Functions
xxvi
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 1
I/O D EVICE I NTRODUCTION
1.1 Overview
The TILE-Gx™ family of processors contains numerous on-chip I/O devices to facilitate direct
connections to DDR3 memory, Ethernet, PCI Express, USB, I2C, and other standard interfaces.
This chapter provides a brief overview of the on-chip I/O devices. For additional information
about the I/O devices, refer to individual chapters in the book. For detailed system programming
information refer to the Special Purpose Registers (SPRs) and the associated device API guides.
1.1.1
Tile-to-Device Communication
Tile processors communicate with I/O devices via loads and stores to MMIO (Memory Mapped
IO) space. The page table entries installed in a Tile’s Translation Lookaside Buffer (TLB) contain a
MEMORY_ATTRIBUTE field, which is set to MMIO for pages that are used for I/O device communication.
The X,Y fields in the page table entry indicate the location of the I/O device on the mesh and the
translated physical address is used by the I/O device to determine the service or register being
accessed.
Since each I/O TLB entry contains the X,Y coordinate of the I/O device being accessed, each
device effectively has its own 40-bit physical address space for MMIO communication that is not
shared with other devices or Tile physical memory space.
This physical memory space is divided into the fields shown in Figure 1-1 and defined in
Table 11.
Note: Not all I/O devices use this partitioning. For example MICA does not have Regions or
Service Domains. It uses a different type of division, which is described in “Common
Accelerator Interface (MiCA)” on page 175.
31
0
A
B
RESERVED
D
E
Offset: 0...E
Region: 0...D
Reserved: 0...C
Service Domain: 0...B
Channel: 0...A
Figure 1-1. TILE-Gx Device Address Space
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
1
Chapter 1 I/O Device Introduction
Table 11. TILE-Gx Physical Memory Space Descriptions
Bits
Bit Name
Required
Size
Description
A
Channel
No
Variable
Used when more than one device shares the same mesh location.
B
Service
Domain
No
Variable
Used to index “permissions” table and allow/deny access to specific device services.
C
Reserved
No
Variable
Any “middle” bits of address that are not used.
D
Region
No
Variable
Selects service being accessed (for example register space vs.
DMA descriptor post).
E
Offset
Yes
Variable
Address within the “Region” being accessed.
Each device has registers in Region-0 used to control and monitor the device. Devices can also
implement additional MMIO address spaces for device communication protocols, such as posting
DMA descriptors or returning buffers. System software is responsible for creating and maintaining the page table mappings that provide access to device services.
1.1.2
Coherent Shared Memory
I/O devices that provide bulk data transport utilize the high-performance, shared memory system implemented on TILE-Gx processors. All Tile memory system reads and writes initiated from
an I/O device are delivered to a home Tile as specified in the physical memory attributes for the
associated cacheline.
I/O TLBs and/or memory management units (MMUs) are used to translate user or external I/O
domain addresses into Tile physical addresses. This provides protection, isolation, and virtualization via a standard virtual memory model.
1.1.3
Device Protection
In addition to the protection provided by the TLB for MMIO loads and stores, devices can provide
additional protection mechanisms via the service domain field of the physical address. This
allows, for example, portions of a large I/O physical address space to be fragmented, such that
services can be allowed/denied to particular user processes without requiring dedicated (smaller)
TLB mappings for each allowed service.
1.1.4
Interrupts
Devices interrupts are delivered to Tile software via the Tile Interprocess Interrupt (IPI) mechanism. Each Tile has four IPI MPLs, each with 32 interrupt events. I/O interrupts have
programmable bindings in their MMIO register space, which specify the target Tile, interrupt
number (also referred to as the IPI Minimum Protection Level or IPI MPL), and event number.
System software can choose to share Tile interrupt event bits among multiple I/O devices or dedicate the interrupt bits to a single I/O interrupt. Interrupt bits can also be shared between I/O and
Tile-to-Tile interrupts.
I/O devices implement interrupt status and enable bits to allow interrupt sharing and coalescing.
2
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Overview
1.1.5
Device Discovery
To facilitate a common device initialization framework, the TILE-Gx processors contain registers
and I/O structures that allow non-device-specific software to “discover” the connected I/O
devices for a given chip. After discovery, device-specific software drivers can be launched as
needed.
All TILE-Gx processors contain an Rshim. The Rshim contains chip-wide services including boot
controls, diagnostics, clock controls, reset controls, and global device information.
The Rshim’s RSH_FABRIC_DIM, RSH_FABRIC_CONN, and RSH_IPI_LOC registers provide Tilefabric sizing, I/O connectivity, and IPI information to allow software to enumerate the various
devices. The common registers located on each device contain the device identifier used to launch
device-specific driver software.
In order for Level-1 boot software to perform discovery, it must first find the Rshim. This is done
by reading the RSH_COORD SPR located in each Tile.
Thus the basic device discovery flow is:
1.
Read the RSHIM_COORD SPR to determine the Rshim location on the mesh.
2.
Install an MMIO TLB entry for the Rshim.
3.
Read the RSH_FABRIC_CONN vectors from Rshim to determine I/O device locations.
4.
Install MMIO TLB entries for each I/O device.
5.
Read the RSH_DEV_INFO register from each device to determine what the device type is, and
launch any device-specific software.
1.1.6
Common Registers
While each device has unique performance and API requirements, a common device architecture
allows a modular software driver model and device initialization process. The first 256 bytes of
MMIO space contains the “common” registers that all I/O devices implement.1 The common registers are used for device discovery as well as basic physical memory initialization and MMIO
page sizing.
Table 12. Common Registers
Register
Address
Description
DEV_INFO
0x0000
This provides general information about the device attached to this port
and channel.
DEV_CTL
0x0008
This provides general device control.
MMIO_INFO
0x0010
This provides information about how the physical address is interpreted by
the I/O device.
MEM_INFO
0x0018
This provides information about memory setup required for this device.
SCRATCHPAD
0x0020
This is for general software use, and is not used by the I/O shim hardware
for any purpose.
1. The “common registers” are located from 0x0000-0x0058.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
3
Chapter 1 I/O Device Introduction
Table 12. Common Registers (continued)
Register
Address
Description
SEMAPHORE0
and SEMAPHORE1
0x0028 and
0x0030
This is for general software use, and is not used by the I/O shim hardware
for any purpose.
CLOCK_COUNT
0x0038
This is for general software use, and is not used by the I/O shim hardware
for any purpose.
HFH_INIT_CTL
0x0050
Initialization control for the hash-for-home tables.
HFH_INIT_DAT
0x0058
Read/Write data for hash-for-home tables.
Each of the major register sets (for example: the GPIO, UART, and MICA Crypto registers) for a
specific device includes the common registers in the register set. The SCRATCHPAD register, for
example, is a common register included in each of the register sets. The register set name prepends the register name as follows:
•
GPIO Register:
GPIO_SCRATCHPAD register
•
UART Register:
UART_SCRATCHPAD register
•
MICA_CRYPTO Register: MICA_CRYPTO_SCRATCHPAD register
Registers beyond 0x100 contain the device specific registers.
Register definitions can be found as part of the MDE build and are located in the HTML directory.
The directory structure is as follows:
•
Memory Controller
•
GPIO
•
Rshim
•
I2C Slave
•
I2C Master
•
SROM
•
UART
Compression
•
MiCA Compression Global
•
MiCA Compression Inflate Engine
•
MiCA Compression Deflate Engine
•
MiCA Compression User Context
•
MiCA Compression System Context
Crypto
4
•
MiCA Crypto Global
•
MiCA Crypto Engine
•
MiCA Crypto User Context
•
MiCA Crypto System Context
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Overview
MPIPE / MACs
•
mPIPE
•
XAUI (Interface/MAC)
•
GbE (Interface/MAC)
•
Interlaken (Interface/MAC)
•
mPIPE SERDES Control
TRIO / PCIe
•
TRIO
•
PCIe Interface (SERDES control, endpoint vs. root etc.)
•
PCIe Endpoint
•
PCIe Root Complex
•
PCIe SERDES Control
USB
•
USB Host
•
USB Endpoint
•
USB Host MAC
•
USB Endpoint MAC
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
5
Chapter 1 I/O Device Introduction
6
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 2
TILE P ROCESSOR
2.1 System Architecture Overview
The Tile Processor™ is a new class of multicore processing engine that delivers unprecedented
levels of performance, flexibility, and power efficiency. The Tile processor is fully programmable
using standard ANSI C and C++, which makes it easy to port existing applications to the Tile Processor environment. The device implements Tilera’s iMesh™ Multicore technology, which
enables applications to be scaled across multiple cores or tiles. Combining multiple C and C++
programmable processor tiles with iMesh multicore technology enables the Tile Processor to
achieve the performance of an ASIC in a software-programmable solution, which reduces development costs and shortens time-to-market.
Each tile is a powerful full-featured processor that can independently run an entire operating system. Each tile implements a 64-bit integer processor engine, a memory management unit (MMU)
including TLBs (Translation Lookaside Buffers), a register file, a program counter (PC), and an
L1/ L2 cache subsystem.
The tiles in the Tile Processor are connected to each other, to the external memory, and to the I/O
by the Tilera iMesh multicore technology. Attaching the memory controllers and I/O to the iMesh
allows any tile to access any memory and also allows any tile to service any I/O device. The
iMesh also supplies very low-latency messaging and scalar transfers to user-level applications,
enabling very efficient multi-programming. Tilera’s iMesh multicore technology enables the Tile
Processor to provide performance scalability and high bandwidth/low latency communication
between tiles.
The Tile Processor implements a powerful protection mechanism of processing resources to allow
fine-grained control and management by operating systems and/or virtual machine implementation. The Tile processor implements four protection levels, which can be used simultaneously to
supply user level, operating system level, virtual machine level, and debug level programming.
The protection system divides processing resources into functionally related groups, and protects
each of these functions with an individual access control. This allows the distribution of the control over these processing resources to different levels of the support software stack.
The Tile Processor supplies memory protection by implementing a virtual memory system with
support for multiple operating systems running on multiple tiles. The virtual memory system
implements 64-bit virtual addresses, and up to 64 bits of physical address space. The Tile processor can support a physical address mode for applications that do not require virtual memory. The
memory system supports a number of coherent shared memory options with different performance characteristics to optimize performance for different kinds of workloads.
The processor engine, the primary computational resource, is an asymmetric very long instruction
word (VLIW) processor. Each instruction bundle is 64-bits wide and can encode either two or
three instructions. Some instructions can be encoded in either two-wide or three-wide bundles,
and some can be encoded in two-wide bundles only. The most common instructions and those
with short immediates can be encoded in a three instruction format. Nearly all instructions have a
single-cycle result latency, with the exception of complex SIMD instructions, multiplication, most
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
7
Chapter 2 Tile Processor
floating-point and memory instructions. Nearly all of the multi-cycle instructions are fully pipelined, allowing additional instructions to be issued in the following cycles. The complex SIMD
instructions, multiplication and floating-point instructions have a two-cycle result latency. Memory load instructions that cannot supply data immediately from the L1 data cache, will not stall
until the register that is being written by the load instruction is attempted to be read by another
operation. In this way, additional instructions can be issued during cache misses. Additional
information about the processor engine and its instruction set is provided in the Instruction Set
Architecture for TILE-Gx (UG401).
2.2 Memory Architecture
The Tile Processor architecture defines a flat, globally shared 64-bit physical address space and a
64-bit virtual address space (note that Tile-Gx processors implement a 40-bit subset physical
address and 42-bit subset virtual address). Memory is byte-addressable and can be addressed in 1,
2, 4 or 8 byte units, depending on alignment. Memory transactions to and from a tile occur via the
iMesh.
The globally shared physical address space provides the mechanism by which software running
on different tiles, and I/O devices, share instructions and data. Memory is stored in off-chip
DDR3 DRAM.
Page tables are used to translate virtual addresses to physical addresses (page size range is 4 kB to
64 GB). The translation process includes a verification of protected regions of memory, and also a
designation of each page of physical addresses as either coherent, non-coherent, uncacheable, or
memory mapped I/O (MMIO). For coherent and non-coherent pages, values from recentlyaccessed memory addresses are stored in caches located in each tile. Uncacheable and MMIO
addresses are never put into a tile cache.
The Address Space Identifier (ASID) is used for managing multiple active address spaces.
Recently-used page table entries are cached in TLBs (Translation Lookaside Buffers) in both tiles
and I/O devices.
Hardware provides a cache-coherent view of memory to applications. That is, a read by a tile or I/
O device to a given physical address will return the value of the most recent write to that address,
even if it is in a tile’s cache. Instruction memory that is written by software (self-modifying code)
is not kept coherent by hardware. Rather, special software sequences using the icoh instruction
must be used to enforce coherence between data and instruction memory.
Atomic operations include FetchAdd, CmpXchg, FetchAddGez, Xchg, FetchOr, and
FetchAnd. Memory ordering is relaxed, and a memory fence instruction provides sequential
ordering. See “Memory Consistency Model” on page 47.
Virtual Address Space
The virtual address is architecturally 64 bits, but is implemented as 42 bits in the Tile-Gx processor. Virtual addresses that are not sign-extended values (i.e. bits[63:41] of the VA are all 0’s or all
1’s) are illegal — the implication of this is that there are two legal VA regions, lower and upper,
and an illegal region in the middle, as shown in Table 2-1.
.
It is illegal to do a memory operation (for example load or store), or to execute instructions
from an illegal VA, or to take a branch from the lower to upper VA region (or vice-versa). An
attempt to do so will result in an exception.
8
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Table 2-1. Virtual Address Space
Address
264-1
Region
Upper VA Region
...
264-241
264-241-1
Illegal VA Region
...
241
241-1
Lower VA Region
...
0
2.3 Memory Addressing
2.3.1
TLB Management
Each TLB entry can be directly read and written by system software. The architecture does not
prescribe the number of TLB entries that the Tile architecture contains. Rather, it sets a maximum
number of TLB entries and allows the number of implemented TLB entries to be an implementation parameter. The special purpose registers, NUMBER_DTLB and NUMBER_ITLB, are read-only
special purpose registers that denote how many TLB entries each of the respective TLBs contain.
TILE-Gx implements 16 ITLB entries and 32 DTLB entries. The WIRED_DTLB and WIRED_ITLB
entries specify the number of TLB entries that are managed completely by software and will not
be selected by hardware for replacement.
The REPLACEMENT_ITLB and REPLACEMENT_DTLB SPRs are maintained by hardware to generate a recommended replacement TLB entry. The specific algorithm used by hardware generates
the replacement entry numbers is implementation-specific. TILE-Gx uses a random replacement
algorithm. They are reset to the number of implementation-specific TLB entries minus one on processor reset. The recommended TLB entry will not be a wired entry.
A given TLB entry can be read or written by first indexing the desired element and then by using
the proper TLB_CURRENT_x SPR to read or write the entry indexed by the index register. To
allow register indexing into the TLB, the DTLB_INDEX and ITLB_INDEX registers are used. There
are three SPRs (TLB_CURRENT_VA, TLB_CURRENT_PA and TLB_CURRENT_ATTR), which access
the three words in a TLB entry that is indexed by DTLB_INDEX or ITLB_INDEX. The TLB current
registers do not contain state, but rather are indexes into the TLB state. TLB_CURRENT_x registers
for each type of TLB, namely DTLB_CURRENT and ITLB_CURRENT are supported. To read a TLB
entry, software writes the index into the DTLB_INDEX or ITLB_INDEX SPR and sets the top bit.
The setting of the MSB causes the specified entry to be read and stored in the TLB_CURRENT_x
SPRs. Software should issue a DRAIN instruction between setting the index register and reading
data from the TLB_CURRENT_x registers. To write a TLB entry, software writes the index into the
DTLB_INDEX or ITLB_INDEX SPR and writes the TLB_CURRENT_x SPRs in a particular order.
The write of TLB_CURRENT_ATTR is the trigger for writing the entire TLB entry specified by the
TLB_CURRENT_x registers into the actual TLB. This causes the TLB to be written.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
9
Chapter 2 Tile Processor
Data TLB Number of Entries Register (NUMBER_DTLB)
This register specifies how many data TLB entries there are.
Speed
Fast
Minimum Protection Level
DTLB_MISS
180
5HVHUYHG[
Figure 2-1: NUMBER_DTLB Register Diagram
Table 2-2. NUMBER_DTLB Register Bit Descriptions
Bits
10
Name
63:13
Reserved
12:0
NUM
Reset
Description
Reserved
0
Number. TILE-Gx implements the bitfield 5:0; writes to bits 12:6 are ignored, and
these bits are read as 0.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Instruction TLB Number of Entries Register (NUMBER_ITLB)
This register specifies how many instruction TLB entries there are.
Speed
Fast
Minimum Protection Level
ITLB_MISS
180
5HVHUYHG[
Figure 2-2: NUMBER_ITLB Register Diagram
Table 2-3. NUMBER_ITLB Register Bit Descriptions
Bits
Name
63:13
Reserved
12:0
NUM
Reset
Description
Reserved
0
Number.
TILE-Gx implements the bitfield [4:0]; writes to bits 12:5 are ignored, and these bits
are read as 0.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
11
Chapter 2 Tile Processor
Instruction TLB Replacement Index Register (REPLACEMENT_ITLB)
This register specifies which instruction TLB entry should be replaced by the random replacement
algorithm.
Speed
Fast
Minimum Protection Level
ITLB_MISS
,1'(;
5HVHUYHG[
Figure 2-3: REPLACEMENT_ITLB Register Diagram
Table 2-4. REPLACEMENT_ITLB Register Bit Descriptions
Bits
12
Name
63:12
Reserved
11:0
INDEX
Reset
Description
Reserved
0
Index.
For TILE-Gx this bitfield implements the bitfield 3:0; writes to bits 11:4 are
ignored, and these bits are read as 0.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Data TLB Replacement Index Register (REPLACEMENT_DTLB)
This register specifies which data TLB entry should be replaced by the random replacement algorithm.
Speed
Fast
Minimum Protection Level
DTLB_MISS
,1'(;
5HVHUYHG[
Figure 2-4: REPLACEMENT_DTLB Register Diagram
Table 2-5. REPLACEMENT_DTLB Register Bit Descriptions
Bits
Name
63:12
Reserved
11:0
INDEX
Reset
Description
Reserved
0
Index. Bitfield 4:0 is implemented, but writes to bits 11:5 are ignored, and these bits
are read as 0.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
13
Chapter 2 Tile Processor
Instruction TLB Entry VA Register (ITLB_CURRENT_VA)
This register is used to read and write the virtual address of the main processor instruction TLB
Entry.
Speed
Fast
Minimum Protection Level
ITLB_MISS
5HVHUYHG[
931
Figure 2-5: ITLB_CURRENT_VA Register Diagram
Table 2-6. ITLB_CURRENT_VA Register Bit Descriptions
Bits
14
Name
63:12
VPN
11:0
Reserved
Reset
0
Description
Virtual Page Number.
Reserved
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Instruction TLB Entry PA Register (ITLB_CURRENT_PA)
This register is used to read and write the physical address of the main processor instruction TLB
Entry.
Speed
Fast
Minimum Protection Level
ITLB_MISS
5HVHUYHG[
3)1
5HVHUYHG[
Figure 2-6: ITLB_CURRENT_PA Register Diagram
Table 2-7. ITLB_CURRENT_PA Register Bit Descriptions
Bits
Name
63:40
Reserved
39:12
PFN
11:0
Reserved
Reset
Description
Reserved
0
Physical Frame Number.
Reserved
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
15
Chapter 2 Tile Processor
Instruction TLB Entry Attribute Register (ITLB_CURRENT_ATTR)
This register is used to read and write the processor instruction TLB Entry attribute. Writing this
register triggers the write of the ITLB.
Speed
Fast
Minimum Protection Level
ITLB_MISS
9
:
03/
36
*
$6,'
0(025<B$775,%87(
&$&+(B+20(B0$33,1*
12B/'B$//2&$7,21
$'$37,9(B$//2&$7,21
3,1
5HVHUYHG[
&$&+(B35()(7&+
/2&$7,21B<B25B3$*(B2))6
/2&$7,21B;B25B3$*(B0$6
5HVHUYHG[
Figure 2-7: ITLB_CURRENT_ATTR Register Diagram
Table 2-8. ITLB_CURRENT_ATTR Register Bit Descriptions
Bits
16
Name
Reset
Description
63:48
Reserved
Reserved
47:37
LOCATION_X_OR_PAGE_MASK
0
Location Override Target X field for MMIO page and the
non-hash-for-home Coherent or NonCoherent pages;
Page mask field for hash-for-home Coherent or NonCoherent pages.
36:26
LOCATION_Y_OR_PAGE_OFFSET
0
Location Override Target Y field for MMIO page and the
non-hash-for-home Coherent or NonCoherent pages;
Page offset field for hash-for-home Coherent or NonCoherent pages.
25
CACHE_PREFETCH
0
Cache Page Prefetch Attribute. Hardware prefetcher
may generate prefetches. The TILE-GX does not implement the page-prefetch hardware, and the attribute is
reserved for the future implementation.
24
Reserved
23
PIN
Reserved
0
PIN. L2 and L3 cache allocation follows the TILE-Gx
cache pinning rule. The attribute is used in Coherent
and NonCoherent pages, and is ignored in the Uncacheable or MMIO pages.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Table 2-8. ITLB_CURRENT_ATTR Register Bit Descriptions (continued)
Bits
Name
Reset
Description
22
ADAPTIVE_ALLOCATION
0
L2 cache allocation follows the TILE-Gx Adaptive Allocation rule. The attribute is used in Coherent and NonCoherent pages, and is ignored in the Uncacheable or
MMIO pages.
21
No_L1D_ALLOCATION
0
L1D cache will not be filled (ignored by the instruction
stream requests). The attribute is used in Coherent and
NonCoherent pages, and is ignored in the Uncacheable
or MMIO pages.
20:19
CACHE_HOME_MAPPING
0
Describes how the home cache for each cacheline is
determined.
18:17
16:9
Memory Attribute
ASID
0
0
Value
0
Name
HASH
Meaning
The home cache is computed from
the cacheline's physical address
using the default hash-for-home
scheme.
3
TILE
For all lines, the home is the tile
whose X and Y coordinates are
specified in the
LOTAR_X_OR_PAGEMASK and
LOTAR_Y_OR_PAGEOFFSET
fields.
Describes how accesses to memory via this translation
are cached, if at all.
Value
0
Name
Meaning
COHERENT Data is cached locally; loads
and stores target the home
cache upon a miss in the local
cache, and the home cache
invalidates the local cache if the
data is changed in the L3.
1
NONCOHERENT Data is cached locally; loads
and stores target the home
cache upon a miss in the local
cache, but the home cache
does not invalidate the local
cache if the data is changed in
the L3.
2
UNCACHEABLE The data is never cached
locally; loads and stores always
target the memory controller.
3
MMIO
The data is never cached
locally; loads and stores target
an I/O device whose address is
given by the
LOTAR_X_OR_PAGEMASK
and
LOTAR_Y_OR_PAGEOFFSET
fields.
Address Space Identifier
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
17
Chapter 2 Tile Processor
Table 2-8. ITLB_CURRENT_ATTR Register Bit Descriptions (continued)
Bits
Name
Reset
Description
8
G
0
Global
7:4
PS
0
Page Size
Value
0
1
2
3
4
5
6
7
8
9
10
11
12
18
Name
4K_PAGE
16K_PAGE
64K_PAGE
256K_PAGE
1M_PAGE
4M_PAGE
16M_PAGE
64M_PAGE
256M_PAGE
1G_PAGE
4G_PAGE
16G_PAGE
64G_PAGE
Meaning
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
3:2
MPL
0
Minimum Protection Level
1
W
0
Writable
0
V
0
Valid
4K-byte
16K-byte
64K-byte
256K-byte
1M-byte
4M-byte
16M-byte
64M-byte
256M-byte
1G-byte
4G-byte
16G-byte
64G-byte
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Data TLB Entry VA Register (DTLB_CURRENT_VA)
This register is used to read and write the virtual address of the main processor data TLB Entry.
Speed
Fast
Minimum Protection Level
DTLB_MISS
5HVHUYHG[
931
Figure 2-8: DTLB_CURRENT_VA Register Diagram
Table 2-9. DTLB_CURRENT_VA Register Bit Descriptions
Bits
Name
63:12
VPN
11:0
Reserved
Reset
0
Description
Virtual Page Number.
Reserved
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
19
Chapter 2 Tile Processor
Data TLB Entry PA Register (DTLB_CURRENT_PA)
This register is used to read and write the physical address of the main processor data TLB Entry.
Speed
Fast
Minimum Protection Level
DTLB_MISS
5HVHUYHG[
3)1
5HVHUYHG[
Figure 2-9: DTLB_CURRENT_PA Register Diagram
Table 2-10. DTLB_CURRENT_PA Register Bit Descriptions
Bits
20
Name
63:40
Reserved
39:12
PFN
11:0
Reserved
Reset
Description
Reserved
0
Physical Frame Number.
Reserved
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Data TLB Entry Attribute Register (DTLB_CURRENT_ATTR)
This register is used to read and write the processor data TLB Entry attribute. Writing this register
triggers the write of the DTLB.
Speed
Fast
Minimum Protection Level
DTLB_MISS
9
:
03/
36
*
$6,'
0(025<B$775,%87(
&$&+(B+20(B0$33,1*
12B/'B$//2&$7,21
$'$37,9(B$//2&$7,21
3,1
5HVHUYHG[
&$&+(B35()(7&+
/2&$7,21B<B25B3$*(B2))6(7
/2&$7,21B;B25B3$*(B0$6.
5HVHUYHG[
Figure 2-10: DTLB_CURRENT_ATTR Register Diagram
Table 2-11. DTLB_CURRENT_ATTR Register Bit Descriptions
Bits
Name
Reset
Description
63:48
Reserved
Reserved
47:37
LOCATION_X_OR_PAGE_MASK
0
Location Override Target X field for MMIO page and the
non-hash-for-home Coherent or NonCoherent pages;
Page mask field for hash-for-home Coherent or NonCoherent pages.
36:26
LOCATION_Y_OR_PAGE_OFFSET
0
Location Override Target Y field for MMIO page and the
non-hash-for-home Coherent or NonCoherent pages;
Page offset field for hash-for-home Coherent or NonCoherent pages.
25
CACHE_PREFETCH
0
Cache Page Prefetch Attribute: Hardware prefetcher
may generate prefetches. The TILE-GX does not implement the page-prefetch hardware, and the attribute is
reserved for the future implementation.
24
Reserved
23
PIN
22
ADAPTIVE_ALLOCATION
ADAPTIVE ALLOCATION
21
No_L1D_ALLOCATION
No_L1D_Allocation
20:19
Cache Home Mapping
Cache Home Mapping
18:17
Memory Attribute
Memory Attribute
Reserved
0
PIN
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
21
Chapter 2 Tile Processor
Table 2-11. DTLB_CURRENT_ATTR Register Bit Descriptions (continued)
Bits
Name
Reset
Description
16:9
ASID
Address Space Identifier
8
G
Global
7:4
PS
Page Size
Value
0
1
2
3
4
5
6
7
8
9
10
11
12
22
3:2
MPL
1
W
0
Writable
0
V
0
Valid
Name
4K_PAGE
16K_PAGE
64K_PAGE
256K_PAGE
1M_PAGE
4M_PAGE
16M_PAGE
64M_PAGE
256M_PAGE
1G_PAGE
4G_PAGE
16G_PAGE
64G_PAGE
Meaning
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
Page size:
4K-byte
16K-byte
64K-byte
256K-byte
1M-byte
4M-byte
16M-byte
64M-byte
256M-byte
1G-byte
4G-byte
16G-byte
64G-byte
Minimum Protection Level
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Data TLB Index Register (DTLB_INDEX)
This register is used to specify which data TLB entry is read and written by the DTLB_CURRENTx
registers. The top bit of this register forces a read of the indexed TLB into the DTLB_CURRENT_x
registers to occur.
Several aspects of TLB read/write behavior bear mentioning:
•
Writing TLB_CURRENT_ATTR is the trigger for writing the entire TLB entry specified by the
TLB_CURRENT_x registers back to the actual TLB.
•
After setting a TLB index register, a DRAIN instruction must be issued before the referenced
TLB entry is readable from the TLB_CURRENT_x registers.
Speed
Fast
Minimum Protection Level
DTLB_MISS
,1'(;
5HVHUYHG[
/
5
Figure 2-11: DTLB_INDEX Register Diagram
Table 2-12. DTLB_INDEX Register Bit Descriptions
Bits
Name
Reset
0
Description
63
R
62
L
Load from REPLACEMENT_DTLB. Reads as zero.
61:12
Reserved
Reserved
11:0
INDEX
0
Read
Index. TILE-Gx implements the bitfield 4:0; writes to bits 11:5 are ignored, and
these
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
23
Chapter 2 Tile Processor
Instruction TLB Index Register (ITLB_INDEX)
This register is used to specify which instruction TLB entry is read and written by the ITLB_CURRENT_x registers. Writing a 1 into the top bit of this register forces a read of the indexed TLB into
the ITLB_CURRENT_x registers to occur.
Several aspects of TLB read/write behavior bear mentioning:
•
Writing TLB_CURRENT_ATTR is the trigger for writing the entire TLB entry specified by the
TLB_CURRENT_x registers back to the actual TLB.
•
After setting a TLB index register, a DRAIN instruction must be issued before the referenced
TLB entry is readable from the TLB_CURRENT_x registers.
Speed
Fast
Minimum Protection Level
ITLB_MISS
,1'(;
5HVHUYHG[
/
5
Figure 2-12: ITLB_INDEX Register Diagram
Table 2-13. ITLB_INDEX Register Bit Descriptions
Bits
Name
Reset
Description
63
R
0
Read
62
L
0
Load from REPLACEMENT_ITLB. Reads as zero
61:12
Reserved
11:0
INDEX
2.3.1.1
Reserved
0
Index. TILE-Gx implements the bitfield 3:0; writes to bits 11:4 are ignored, and
these bits are read as 0.
TLB Miss Handling
When any access occurs to an address that is not in the TLB, a TLB Miss occurs. When a write
access occurs to an address that is in the TLB, but is not marked Writable, a TLB Access Violation
occurs. In either case, the faulting address is loaded into the TLB’s bad address SPR: DTLB_BAD_ADDR or EX_CONTEXT_x. A TLB Miss or TLB Access Violation is then signaled to notify
software of the event. The DTLB_BAD_ADDR_REASON SPR indicates the reason for a DTLB miss or
access violation. Software is responsible for taking whatever action is required: filling in the missing TLB entry, paging in data from disk, terminating a process which has tried to write to a read
only address, and so forth.
System software is responsible for ensuring that there are never multiple DTLB entries that match
on a translation. Multiple matches will cause a memory error to be signaled.
24
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Addressing
Several scenarios exist where special precautions must be taken to ensure that multiple matching
DTLB entries are not installed into the TLB:
•
Software inserts a translation for page P, with the global bit set, and there already exists a translation for page P for a particular ASID.
•
Software inserts an overlapping page due to differing page sizes
To avoid multiple matching DTLB entries, system software can use the data TLB probe instruction (DTLBPR). This instruction takes a register containing a virtual address as its source operand
and it checks the DTLB for any entries that match this virtual address. The ASID field is ignored
when the lookup is performed. The data CPL is used. The match result is written into the DTLB_MATCH_0 SPR. The SPR will contain a 1 in each bit position corresponding to the DTLB entry that
matched the virtual address.
The DTLB probe instruction is also useful when upgrading a “read only” page to a “writable”
page.
Read and write access to the TLBs must be protected to prevent invalid TLB entries from being
added. In order to accomplish this, the MPLs for the respective TLB Miss interrupts are used to
determine what protection level is required to read or write the TLB entries for a particular TLB.
If a TLB resource is accessed without sufficient privileges, a General Protection Violation interrupt is signaled. The General Protection Violation occurs at the minimum protection level of the
faulting resource. When a GP Violation is signaled, the GPV_REASON SPR is filled in with the
access violation cause.
General Protection Violation Reason Register (GPV_REASON)
Contains the reason that a GPV has occurred.
Speed
Slow
Minimum Protection Level
GPV
635B,1'(;
5HVHUYHG[
0)B(5525
07B(5525
,5(7B(5525
5HVHUYHG[
Figure 2-13: GPV_REASON Register Diagram
Table 2-14. GPV_REASON Register Bit Descriptions
Bits
Name
Reset
Description
63:32
Reserved
Reserved
31
IRET_ERROR
0
If there was a IRET violation.
30
MT_ERROR
0
If there was a move to SPR access violation.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
25
Chapter 2 Tile Processor
Table 2-14. GPV_REASON Register Bit Descriptions (continued)
Bits
Name
29
MF_ERROR
28:14
Reserved
13:0
SPR_INDEX
Reset
0
Description
This bit indicates If there was a move from SPR access violation.
Reserved
0
The index of an occurring SPR access violation.
2.4 Memory Consistency Model
2.4.1
Overview
The Tile Processor architecture’s memory consistency model specifies the order in which memory
operations from a processor become visible to other processors in the coherence domain.
There are two main properties, P1 and P2, defined by the memory consistency model: instruction
reordering rules and store atomicity. The Tile Processor architecture defines a relaxed memory
consistency model in which:
P1: Instruction Reordering
Non-overlapping memory accesses from a given processor that reference shared pages can be
reordered and can become visible to other processors sharing that page in an order different from
the original program order, with the following restrictions:
•
Data dependencies through memory accesses from a single processor are enforced (RAW,
WAW, and WAR)
•
Data dependencies through registers or memory determines local visibility order
•
Local ordering established by memory data dependencies or register dependencies does not
determine global visibility order. Data writes (including atomic operations and flushes) must
observe control dependencies.
P2: Store Atomicity
Stores performed by a processor appear to become visible simultaneously to all remote processors, but can become visible to the issuing processor before becoming globally visible (for
example, by bypassing to a subsequent load through a write buffer). Atomic operations are atomic
to all processors: bypassing to or from atomic operations is not allowed.
The Tile Processor architecture provides the memory fence (MF) instruction to establish ordering
among otherwise unordered instructions when such ordering is needed for correctness. Data
memory operations in the program prior to the memory fence instruction are made globally visible before ANY operation after the memory fence.
The Tile Processor architecture provides a FetchAdd, CmpXchg, FetchAddGez, Xchg, FetchOr,
and FetchAnd operations to read and write a memory location atomically.
The following code sequences illustrate the properties of the tile memory consistency model. In
the examples that follow, memory addresses are denoted by x and y, are word aligned, and are
assumed to contain the value 0 initially. All loads and stores are word-sized. The notation A  B
indicates that operation A becomes visible to all processors in the coherence domain before operation B becomes visible. Examples Listing 2-1. through Listing 2-5. below illustrate property P1—
instruction reordering. Examples Listing 2-6. through Listing 2-8. illustrate property P2—store
atomicity and write bypassing.
Listing 2-1. Property P1—Instruction Reordering. Stores can reorder with stores to different locations and loads can
26
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Memory Consistency Model
reorder with loads to different locations.
Tile 0
sw [x] = 1
sw [y] = 1
| Tile 1
| lw r1 = [y]
| lw r2 = [x]
All outcomes for r1 and r2 are possible.
The stores can be made visible in any order. Implementations are free to reorder data memory
operations to different locations. Program order does not imply visibility order.
Listing 2-2. Property P1—Instruction Reordering. Ordering is enforced through the memory fence instruction.
Tile 0
sw [x] = 1 //M1
MF // M2
sw [y] = 1 // M3
|
|
|
|
Tile 1
lw r1 = [y] // M4
MF // M5
lw r2 = [x] // M6
The only illegal outcome is r1 == 1 and r2 == 0.
Notice that this example is the same as in Listing 2-1., except that here we have an MF instruction
inserted between the pair of stores on Tile 0 and also between the pair of loads on Tile 1. The use
of the MF instruction ensures that M1M3 and M4M6. Therefore, if M3 is visible to M4, then
M1 is visible to M6.
Listing 2-3. Property P1—Instruction Reordering. Loads can reorder with stores to different locations.
Tile 0
sw [x] = 1 //M1
lw r1 = [y] // M2
| Tile 1
| sw [y] = 1// M3
| lw r2 = [x]// M4
This example is similar to Listing 2-1., in that the loads and stores on each tile have no dependence and can be freely reordered. All outcomes are legal.
Listing 2-4. Property P1—Instruction Reordering. Preventing loads from passing stores to different locations.
Tile 0
sw [x] = 1 //M1
MF
lw r1 = [y] // M2
|
|
|
|
Tile 1
sw [y] = 1// M3
MF
lw r2 = [x]// M4
The only illegal outcome is r1 == r2 == 0.
This example is similar to the one shown in Listing 2-3., except we now have MF instructions
between the memory operations. The MF on Tile 0 causes M1M2, and the MF on Tile 1 causes
M3M4. Therefore:
If r1 == 0, we have M2M3, so we have M1M2M3M4, so r2 == 1.
If r2 == 0, we have M4M1, so we have M3M4M1M2, so r1 == 1.
If r1 == 1, we have M3M2, but M4 is not ordered with M1, so r2 == 0 OR r2 == 1.
If r2 == 1, we have M1M4, but M2 is not ordered with M3, so r1 == 0 OR r1 == 1.
Listing 2-5. Property P1-Instruction Reordering.
Tile 0
sw [x]=1 //M1
MF //M2
sw [y] = 1 // M3
|
|
|
|
Tile 1
lw r2 = [y]//M4
bbs r5, foo
lw r3 = [x]//M6
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
27
Chapter 2 Tile Processor
Here, r2 == 1, r3 == 0 is a legal outcome. M6 is dependant on the branch, however the branch is
not dependent on M4. Therefore, there is no dependency between M4 and M6 and they can be
reordered. Specifically, M4 may miss in the cache. While the miss is outstanding, the branch and
M6 both execute, and M6 hits in the cache, writing r3 == 0. Then, the stores on Tile 0 execute and
M4 gets the new value of y (1).
Listing 2-6. Property P2—Store Atomicity and Write Bypassing. Local data dependencies do not establish global visibility
ordering: processors can see their own writes early.
Tile 0
sw [x] = 1 //M1
lw r1 = [x] //M2
sw [y] = r1 // M3
|
|
|
|
Tile 1
lw r2 = [y]//M4
MF
//M5
lw r3 = [x]//M6
The following is a legal outcome: r1 == r2 == 1, r3 == 0.
In this case, true data dependencies on Tile 0 cause M1, M2, and M3 to EXECUTE on Tile 0 in
order. However, this does not imply that they become globally visible to Tile 1 in this order.
The above outcome could occur if Tile 0 bypassed the sw to x to the lw x through a write buffer or
local cache. Now, operation M3 writes memory, and operation M4 observes the write M3, but
operation M6 gets to memory before operation M1 has become globally visible. To avoid the local
bypass, Tile 0 should issue a MF instruction between M1 and M2. This forces M1 to become globally visible before M3.
Listing 2-7. Property P2—Store Atomicity and Write Bypassing. Local data dependencies establish local ordering.
Tile 0
sw [x] = 1 //M1
MF //M2
sw [y] = x //M3
| Tile 1
| lw r1 = [y] // M4
| lw r2 = [r1] //M5
r1 == x and r2 == 0 is an illegal outcome.
M5 is data dependent on M4 and thus executes (and becomes locally visible) after M4.
Listing 2-8. Property P2—Store Atomicity and Write Bypassing. Stores have a single order as observed by remote
processors.
Tile 0
sw [x] = 1 //M1
|
|
|
|
Tile 1
lw r1 = [x] //M2
MF
lw r2 = [y] //M3
| Tile 2
| sw [y] = 1 //M4
|
|
|
|
|
|
Tile 3
lw r3 = [y] //M5
MF
lw r4 = [x] //M6
r1 == 1, r3 == 1, r2 == 0, r4 == 0 is an illegal outcome.
If the above outcome were legal, this would imply that Tile 3 observes M4 occurring before M1
and Tile 1 observes M1 occurring before M4. More formally, Tile 1 observes: M1  M2  M3 
M4. While Tile 3 observes: M4  M5  M6  M1. Recalling property P2 of the consistency model,
it should be noted that because a store from a given processor occurs atomically as observed by
remote processors, the above outcome is illegal.
2.5 TILE-Gx Page Attribute Transitions and Cache
Flushes
There are two usage models for page flushing:
•
28
Page attribute transitions, for example changing which cache is the home for a particular page
of memory.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Protection
•
User-managed shared memory
For TILE-Gx, pages can either be flushed via cache displacement flushes or by targeted page
based flushing. It has the same behavior in the operation of MF on TILE-Gx versus TILE64:
•
MF no longer ensures that victims are visible
•
MF no longer ensures istream operations (demand or prefetch) are visible
2.6 Protection
This section discusses protection levels, also referred to as access or privilege levels. This topic is
introduced at this point since it applies to every Special Purpose Register (SPR) and is critical to
using the SPRs.
2.6.1
Levels of Protection
The Tile architecture contains four levels of protection. The protection levels are a strict hierarchy,
thus code or a hardware mechanism executing at one protection level is afforded all of the privileges of that protection level and all lower protection levels. The protection levels are numbered
0-3 with protection level 0 being the least privileged protection level and 3 being the most privileged protection level. Table 2-15 presents an informal mapping from a protection level number
to names. This specification refrains from formally defining names for the four different protection levels (because other protection schemes different from the example used here are possible)
but informally defines one possible name mapping.
Table 2-15. Informal Protection-Level Name Mapping
2.6.2
0
User
1
Supervisor
2
Hypervisor
3
Virtual Machine Level
Protected Resources
The Tile architecture contains several categories of protection mechanisms. These protection
mechanisms include:
•
Preventing illegal instruction execution
•
Preventing instructions from injecting into or reading from selected networks
•
Memory protection via multiple translation lookaside buffers (TLBs)
•
Negotiated Application Programmer Interfaces (APIs) for physical device multiplexing
•
Controlling what protection level to which an interrupt traps
2.7 Interrupt Model
2.7.1
Introduction
Interrupt and exceptions are conditions that cause an unexpected change in control flow of the
currently executing code. Interrupts are asynchronous to the program; exceptions are caused
directly by execution of an instruction.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
29
Chapter 2 Tile Processor
Interrupts and exceptions interject themselves into executing code. In the Tile architecture™,
there is a wider variety of devices than in a standard processor. The interrupt/exception structure
of the Tile architecture is distributed and tiled in the same manner as the rest of the design. An
interrupt or exception that occurs is only reported to the local tile to which it is relevant. By localizing the reporting, no global structures or communication are needed to process the interrupt. If
a local interrupt needs to be reported to a remote location, it is the operating system’s responsibility to communicate that need via one of the architecture’s inter-tile communication mechanisms.
The interrupt/exception structure of the Tile architecture is tightly integrated with the protection
model. The architecture contains a Minimum Protection Level (MPL) for each possible interrupt
or exception that can occur. The MPL is a dual use mechanism used to indicate a minimum protection level required to take some action in the processor without generating an exception. It can
also indicate the protection level at which that the corresponding interrupt or exception handler
executes. Some exceptions occur regardless of protection level. Examples of these are TLB misses
and illegal instruction exceptions. For exceptions that occur regardless of protection level, if the
current protection level (CPL) is less than the MPL for the corresponding exception, the exception
occurs at the MPL for the exception. If the CPL is greater than or equal to the MPL for the corresponding exception, then the exception is executed at the current protection level. For a complete
list of interrupts and exceptions, see Table 2-16, “Interrupt and Exception List,” on page 34.
The Tile architecture uses a vectored approach to interrupts/exceptions; there are four sets of vectors, one for each protection level. On an interrupt or exception, the architecture changes the
program counter to a value derived from the interrupt/exception number and the protection level
at which that the handler is to execute. The offset is value in the INTERRUPT_VECTOR_BASE
SPR for the protection level, plus the protection level multiplied by 16 MB (0x01000000), plus the
interrupt/exception number multiplied by 256. This allows 32 Very Long Instruction Word
(VLIW) instructions to reside in each vector, and allows all of a protection level’s handlers and up
to 16 MB of accompanying code to be mapped into virtual address space using one large-page
ITLB entry. If more than 32 instructions are needed to handle an interrupt/exception, the code can
jump to the rest of the handler located in that same large page, or anywhere else in the address
space.
2.7.1.1
Interrupt/Exception State
When an interrupt or exception occurs, location information (identifying where the interrupted
program is currently executing) must be saved. This feature is designed to allow the return from
the interrupt to the exact location that the interrupt/exception occurred. It allows the handler to
know precisely which instruction caused the exception (if it is caused by a fault in the processor’s
instruction stream). In order to track this location (state) information, hardware in the processor
must save and restore state atomically through a mechanism.
The state that needs to be saved and restored is the program counter of the interrupted process and
the protection level of the interrupted process. The program counter, protection level, and INTERRUPT_CRITICAL_SECTION status register of the interrupted process are stored in a exceptional
context, abbreviated as EX_CONTEXT. The EX_CONTEXT for a given protection level is stored in a
pair of Special Purpose Registers (SPRs), with a separate register pair dedicated to that protection
level. On interrupt/exception, the interrupted process’s state and protection level is stored in the
EX_CONTEXT that corresponds to the protection level at which that the handler is to be executed.
This state is likewise utilized to return via the IRET instruction. On a return from interrupt/exception, the content of the current protection level’s EX_CONTEXT is copied into the current machine’s
program counter and state.
30
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Interrupt Model
2.7.1.2
Nested Interrupts/Exceptions
The Tile architecture takes the approach of pushing the bulk of the interrupt/exception processing into software. By doing this, an interrupt or exception can occur within an already-executing
handler. This introduces a set of problems when an interrupt or exception occurs within a handler
at the same protection level. Primarily, the EX_CONTEXT has no place to be saved.
An example will illustrate the problem with nested interrupts. Process A is executing at protection level 0 (lowest protection level). An interrupt of protection level 1 occurs, which saves the
program counter and protection level of 0 into the EX_CONTEXT state of protection level 1. Now
interrupt handler B begins to execute to resolve the interrupt. While interrupt handler processing
at protection level 1 is executing, the interrupt handler takes some action causing an exception to
occur at protection level 1, the same protection level at which that the interrupt handler B is executing. The hardware now replaces the EX_CONTEXT with the level 1 and program counter of
interrupt handler B while it begins to execute exception handler C. Unfortunately, when exception handler C returns, interrupt handler B is returned too, but the content of EX_CONTEXT is not
the same as when the control flow left interrupt handler B. In effect the context information of
process A was never saved into reliable storage, and was irrecoverably lost. To combat this problem, the Tile architecture utilizes the INTERRUPT_CRITICAL_SECTION status bit.
The INTERRUPT_CRITICAL_SECTION status bit indicates if a process is in the middle of a critical
section of a handler, that is any time the handler is prevented from processing another interrupt
or exception. When an interrupt/exception occurs, the INTERRUPT_CRITICAL_SECTION bit is set,
which indicates that the handler cannot be interrupted. If a non-masked interrupt is at the same
protection level as the current handler, the INTERRUPT_CRITICAL_SECTION bit is set, and a double
fault interrupt is taken instead. If the MPL of the double fault interrupt is greater than the CPL,
the current system state is saved in the target PL’s EX_CONTEXT. This is true for any interrupt.
However, if the MPL of the double fault interrupt is less than or equal to the CPL, then the double
fault handler executes at the CPL, and its EX_CONTEXT is not modified. In either case, the double
fault handler will most likely dump the state of the machine and halt. It may also ask a higher
level supervisory layer to dump its state and halt the currently running process.
For more information on double faults, see “Double Faults” on page 39.
It is the responsibility of the writer of an interrupt/exception handler to see that no exceptions
can occur inside its critical section. One of the key implications of this is that memory accesses
while in the critical section of the interrupt handler must not cause a TLB miss. This is actually
only a problem if the TLB miss handler is executing at the current protection level (PL).
Once the EX_CONTEXT is stored in memory, typically in some kernel stack structure or per interrupt state, the processor can unset the INTERRUPT_CRITICAL_SECTION and handle interrupts/
exceptions normally. It is expected that long-running handlers will quickly store away the EX_CONTEXT state and then deassert the INTERRUPT_CRITICAL_SECTION bit so they can use mapped
memory and allow other interrupts/exceptions to occur normally. Under normal circumstances,
the INTERRUPT_CRITICAL_SECTION bit will be reasserted shortly before execution of IRET so that
the handler can safely restore the EX_CONTEXT.
If an interrupt or exception occurs at a higher protection level, regardless of the INTERRUPT_CRITICAL_SECTION status bit, the higher level’s EX_CONTEXT is filled in with the current PC, protection
level, and value of INTERRUPT_CRITICAL_SECTION status bit of the interrupted handler. No state
information is lost because of the per protection level EX_CONTEXTs.
2.7.1.3
Interrupt Traits
The Tile architecture refers to all types of exceptions, traps, and interrupts as interrupts. As
described above, interrupts and exceptions are similar, but can have different traits. Exceptions
are caused by instructions and are synchronous to the program. Examples are an illegal instrucTile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
31
Chapter 2 Tile Processor
tion, or a load that takes a DTLB miss. Interrupts are caused by events outside of the program
flow, and are asyncronous. Examples are timer interval, or I/O device completion notification (via
Inter-Processor interrupt). It is the responsibility of an implementation to ensure timely delivery
of asynchronous interrupts, but it is up to the discretion of the implementation to deliver the
interrupt at a convenient time as long as it guarantees that it will not be forever delayed.
When entering INTERRUPT_CRITICAL_SECTION mode on an interrupt or exception, it is desirable
to mask some interrupts automatically until the processor exits the critical section. To accomplish
this goal, interrupts are automatically masked when the INTERRUPT_CRITICAL_SECTION status bit
is set. Note that being in the critical section at a particular protection level does not mask an interrupt from occurring if its MPL is higher than the CPL. This allows a higher level operating system
to handle a lower level’s fault even within a critical section of a lower level’s interrupt handling
routine.
Interrupts can be masked by bits set in the INTERRUPT_MASK_REGISTERs.
2.7.1.4
Interrupt Masks
The INTERRUPT_MASK_X registers consist of a special purpose register per protection level that
controls the masking of the system’s interrupts. Each bit in the mask registers correspond to a particular interrupt.
An interrupt is masked off by setting a corresponding bit in the INTERRUPT_MASK_X registers to 0
where X is the desired protection level (0 through 3). An interrupt is unmasked if the mask bit is
set to a 1.
When a process is in an interrupt critical section as denoted by the INTERRUPT_CRITICAL_SECTION
status bit, additional interrupts may be masked. If the MPL for the interrupt is less than or equal
to the current protection level, then the interrupt is masked when the INTERRUPT_CRITICAL_SECTION status bit is set.
Figure 2-14 presents how the protection level, interrupt masking, and an interrupt critical section
come together to signal whether an interrupt occurs. This figure presents the path for a single
interrupt. Where “I” indicates if the item is a real interrupt or simply an interrupt number used to
indicate information about the protection domain (such as WORLD_ACCESS and BOOT_ACCESS and
so on). This flow is duplicated for each interrupt. A priority encoder determines the highest priority interrupt occurring on a given cycle.
Interrupts that are masked for any reason are delivered when they are unmasked. To clear an
interrupt, some action specific to that interrupt must be taken.
2.7.1.5
INTCTRL and Protection of Interrupt Masks
In order to modify the interrupt related state for protection level X, the executing code’s CPL must
at least be that of INTCTRL_X. The Tile Architecture contains four INTCTRL MPLs that protect the
various interrupt-related SPRs. Namely, for protection level X INTCTRL_X protects the interrupt
masking, exceptional context, and system save SPRs. More specifically, INTCTRL_X protects SPRs
EX_CONTEXT_X, SYSTEM_SAVE_X_[0,1,2,3], INTERRUPT_MASK_X, INTERRUPT_MASK_SET_X, and
INTERRUPT_MASK_RESET_X. Users can change the protection level needed to access this state to
allow the virtualization of this state and facilitate downcall virtualization (refer to “Downcalls”
on page 41).
The default configuration of INTCTRL_X registers sets their MPLs to X itself. This configuration
allows the interrupt masks to be accessed at the namesake protection level. Setting the INTCTRL_X
MPL to PL numerically lower than X while architecturally allowed does not make much sense as
this would allow lower privileged code to modify the interrupt masks of higher privileged code.
Setting the PL higher enables the virtualization of these SPRs.
32
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Interrupt Model
CPL
MPL[i]
>=
ICS
Interrupt ‘i’
signaled
1
0
1
CM[i]
Interrupt ‘i’
signaled
Interrupt Mask
Generation
Figure 2-14: Interrupt Signal
A more detailed view of the interrupt mask generator is shown in Figure 2-15.
IM_0[i]
IM_1[i]
Interrupt Mask
for interrupt ‘i’
IM_2[i]
IM_3[i]
CPL
Figure 2-15: Interrupt Mask Generator
2.7.1.6
VLIW and Interrupts
The Tilera® Tile architecture is a VLIW architecture. A complete VLIW instruction bundle is
atomically executed or not executed.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
33
Chapter 2 Tile Processor
2.7.2
Interrupt and Exception List
In Figure 2-16, the lowest number indicated in the Number field has the highest priority.
Table 2-16. Interrupt and Exception List
Interrupt Number
34
Name
Interrupt or Exception?
Short Name
0
Memory Error
I
MEM_ERROR
1
Single Step 3
E
single_step_3
2
Single Step 2
E
single_step_2
3
Single Step 1
E
single_step_1
4
Single Step 0
E
single_step_0
5
IDN Complete
E
IDN_COMPLETE
6
UDN Complete
E
UDN_COMPLETE
7
ITLB Miss
E
ITLB_MISS
8
Illegal Instruction
E
ILL
9
General Protection Violation
E
GPV
10
IDN Access
E
IDN_ACCESS
11
UDN Access
E
UDN_ACCESS
12
Software Interrupt 3
E
SWINT_3
13
Software Interrupt 2
E
SWINT_2
14
Software Interrupt 1
E
SWINT_1
15
Software Interrupt 0
E
SWINT_0
16
Illegal Translation
E
ill_trans
17
Unaligned Data
E
UNALIGN_DATA
18
DTLB Miss
E
DTLB_MISS
19
DTLB Access Error
E
DTLB_ACCESS
20
IDN Firewall Violation
I
IDN_FIREWALL
21
UDN Firewall Violation
I
UDN_FIREWALL
22
Tile Timer
I
TILE_TIMER
23
Auxiliary Tile Timer
I
AUX_TILE_TIMER
24
IDN Timer
I
IDN_TIMER
25
UDN Timer
I
UDN_TIMER
26
IDN Available
I
IDN_AVAIL
27
UDN Available
I
UDN_AVAIL
28
Interprocessor Interrupt 3
I
ipi_3
29
Interprocessor Interrupt 2
I
ipi_2
30
Interprocessor Interrupt 1
I
ipi_1
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Interrupt Model
Table 2-16. Interrupt and Exception List (continued)
Interrupt Number
2.7.3
Name
Interrupt or Exception?
Short Name
31
Interprocessor Interrupt 0
I
ipi_0
32
Performance Counters
I
PERF_COUNT
33
Auxiliary Performance Counters
I
AUX_PERF_COUNT
34
Interrupt Control 3
I
INTCTRL_3
35
Interrupt Control 2
I
INTCTRL_2
36
Interrupt Control 1
I
INTCTRL_1
37
Interrupt Control 0
I
INTCTRL_0
38
Boot Access
39
World Access
I
WORLD_ACCESS
40
Instruction ASID
I
I_ASID
41
Data ASID
I
D_ASID
42
Double Fault
I
DOUBLE_FAULT
BOOT_ACCESS
Interrupt State, Control Registers, Double Faults, and IRET
2.7.3.1
Interrupt State and Control Registers
The interrupt state registers are mapped, that is, they maintain specific addresses in memory as
part of the architecture’s special purpose register space.
EX_CONTEXT
The EX_CONTEXT is vital to the interrupt process. EX_CONTEXT is an abbreviation for exceptional
context. An EX_CONTEXT is provided for each of the architecture’s four protection levels. Each
EX_CONTEXT consists of two Special Purpose Registers. The EX_CONTEXT registers are named EX_CONTEXT_0_0, EX_CONTEXT_0_1, EX_CONTEXT_1_0, EX_CONTEXT_1_1, EX_CONTEXT_2_0,
EX_CONTEXT_2_1, EX_CONTEXT_3_0, and EX_CONTEXT_3_1. The first appended number corresponds to the protection level and the second number identifies if it is the first or second word of
the EX_CONTEXT. The first word of an EX_CONTEXT, EX_CONTEXT_X_0 contains the exceptional program counter (PC). The second word of an EX_CONTEXT, EX_CONTEXT_X_1 contains the protection
level (PL) and the exceptional INTERRUPT_CRITICAL_SECTION status bit. The diagrams in
Figure 2-16 through Figure 2-23 show the bit locations of an EX_CONTEXT.
Interrupt Mask Register
The INTERRUPT_MASK_X_X registers allow a program to mask out interrupts. X indicates the Protection Level number. This set of registers includes: INTERRUPT_MASK_0, INTERRUPT_MASK_1,
INTERRUPT_MASK_2, and INTERRUPT_MASK_3. See “Interrupt Masks” on page 32 for more details
on the control of the Interrupt Mask Register.
Interrupt Critical Section Status Register
The Tile architecture contains a single global INTERRUPT_CRITICAL_SECTION status bit [Bit 0] in
the INTERRUPT_CRITICAL_SECTION register (see Figure 2-27). There are no restrictions on changing this bit. This bit is active high and indicates if the system is in an interrupt critical section (if
Bit 0 is set to 1).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
35
Chapter 2 Tile Processor
Further, this status register controls whether or not data is copied into an EX_CONTEXT when an
interrupt is signaled. This status bit is set when an interrupt or exception occurs. When a valid
IRET is issued, the state found in EX_CONTEXT_CPL_X.ICS is copied to the INTERRUPT_CRITICAL_SECTION register.
Last Interrupt Reason Register
Double fault handlers need to determine what is happening when a DOUBLE_FAULT interrupt
occurs. In order to make this determination, the TILE architecture supports an SPR that contains
the last two interrupt/exception reasons. The last two interrupt reasons are saved in a single
LAST_INTERRUPT_REASON SPR register (Figure 2-27). The LAST_INTERRUPT_REASON is shifted over
whenever an interrupt occurs, with the new interrupt reason being shifted in.
On a double fault, instead of a double fault being registered in the LAST_INTERRUPT_REASON register, the highest priority interrupting reason that caused the double fault is stored. On reset, the
LAST_INTERRUPT_REASON registers are reset to the double fault interrupt number. This register is
protected by the DOUBLE_FAULT MPL to prevent unauthorized access.
For more information on double faults, see “Double Faults” on page 39.
Note that the following are sample register diagrams. For a complete list of register descriptions,
see Table 8-1, “Special Purpose Registers,” on page 106.
EX_CONTEXT_0_0
3&
5HVHUYHG[
Figure 2-16: EX_CONTEXT_0_0 Register Diagram
EX_CONTEXT_0_1
3/
,&6
5HVHUYHG[
Figure 2-17: EX_CONTEXT_0_1 Register Diagram
EX_CONTEXT_1_0
3&
5HVHUYHG[
Figure 2-18: EX_CONTEXT_1_0 Register Diagram
36
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Interrupt Model
EX_CONTEXT_1_1
3/
,&6
5HVHUYHG[
Figure 2-19: EX_CONTEXT_1_1 Register Diagram
EX_CONTEXT_2_0
3&
5HVHUYHG[
Figure 2-20: EX_CONTEXT_2_0 Register Diagram
EX_CONTEXT_2_1
3/
,&6
5HVHUYHG[
Figure 2-21: EX_CONTEXT_2_1 Register Diagram
EX_CONTEXT_3_0
3&
5HVHUYHG[
Figure 2-22: EX_CONTEXT_3_0 Register Diagram
EX_CONTEXT_3_1
3/
,&6
5HVHUYHG[
Figure 2-23: EX_CONTEXT_3_1 Register Diagram
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
37
Chapter 2 Tile Processor
INTERRUPT_MASK_X_0
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
5HVHUYHG[
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
5HVHUYHG[
Figure 2-24: INTERRUPT_MASK_0 Register Diagram
38
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Interrupt Model
INTERRUPT_MASK_X_1
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
5HVHUYHG[
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
0$6.B
5HVHUYHG[
Figure 2-25: INTERRUPT_MASK_X_1 Register Diagram
INTERRUPT_CRITICAL_SECTION
,&6
5HVHUYHG[
Figure 2-26: INTERRUPT_CRITICAL_SECTION State Register Diagram
2.7.3.2
Double Faults
A double fault occurs when an unmasked interrupt or exception occurs at the current protection
level while the INTERRUPT_CRITICAL_SECTION status bit is set. This is, in effect, an interrupt
inside of an interrupt handler critical section. Because the current protection level’s EX_CONTEXT
state has not been saved to memory at this point, the EX_CONTEXT will not be overwritten. Double
faults are typically due to programmer error as critical sections of interrupt handlers should not
take an interrupt or exception. Note that interrupts are implicitly masked when the INTERRUPT_CRITICAL_SECTION status bit is set.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
39
Chapter 2 Tile Processor
LAST_INTERRUPT_REASON
/$67B5($621
5HVHUYHG[
/$67B/$67B5($621
5HVHUYHG[
Figure 2-27: LAST_INTERRUPT_REASON State Register Diagram
When a double fault occurs, the LAST_INTERRUPT_REASON register is shifted eight bits to the left
and filled in with the highest priority interrupt/exception causing the double fault. The last two
interrupt/exception reasons are tracked. The double fault interrupt handler can inspect the
LAST_INTERRUPT_REASON register to determine if it is possible to recover from the double fault
and to provide debugging information about where the error occurred.
2.7.3.3
IRET
The IRET instruction is used to signal the end of an interrupt/exception routine. The IRET instruction atomically copies the exception context state to the program counter (PC), current protection
level (CPL), and the INTERRUPT_CRITICAL_SECTION status bit.
An IRET atomically takes the following actions. First, it verifies that the privilege level that the
IRET will be returning to, which is stored in the EX_CONTEXT_CPL_1.PL, is less than or equal to the
current protection level. If not, an IRET_ERROR general protection violation occurs. Next the
machine copies the program counter (PC) from the current protection level’s EX_CONTEXT_CPL_0
into the machine’s PC. Next the machine copies the current protection level’s EX_CONTEXT_CPL_1.ICS status bit into the global INTERRUPT_CRITICAL_SECTION status bit. Lastly, the
current protection level (CPL) is updated to be EX_CONTEXT_CPL_1.PL hence restoring the protection level.
2.7.4
Interprocessor Interrupt (IPI)
I/O device and Tile-to-Tile interrupts are delivered to Tile software via the Tile Interprocessor
Interrupt (IPI) mechanism. Each Tile has four IPI MPLs, each with 32 interrupt events. I/O interrupts have programmable bindings in their MMIO register space, which specify the target Tile,
interrupt number (also referred to as the IPI Minimum Protection Level or IPI MPL), and event
number. System software can choose to share Tile interrupt event bits among multiple I/O
devices or dedicate the interrupt bits to a single I/O interrupt. Interrupt bits can also be shared
between I/O and Tile-to-Tile interrupts.
I/O devices implement interrupt status and enable bits to allow interrupt sharing and coalescing.
2.7.5
Distributed Interrupt Processing
The interrupt model on the Tile architecture is a distributed model. Any interrupt that gets signaled only occurs on a single tile. Each tile may receive all of the interrupts laid out in this section.
The architecture only provides for local interrupt delivery; thus, if an operating system would like
to deliver an interrupt to a tile on which that the interrupt did not occur, some form of software
tile-to-tile communication is needed. This communication can be through a Tile-to-Tile IPI or
through memory.
One example implementation may be where all of the system code executes on one tile. In effect
this is the system tile. Then an interrupt occurs on a tile that is not the system tile. In this example,
the designer wants to slim down the system code that is running on the slave, non-system, tiles.
Unfortunately, there must still be at least a system stub running on the slave tile. This stub han-
40
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Interrupt Model
dles the interrupt and communicates the fact that an interrupt occurred on the slave tile to the
system tile. After this communication, the slave tile waits for further instruction from the system
tile. This in effect allows a parallel processing model with one centralized system copy.
2.7.6
Proxying Interrupts
The ring or hierarchical protection model that the Tile architecture provides can be too restrictive
for some protection systems. To allow for more flexible protection systems, it might be required
that an interrupt that was delivered to a high protection level may have to be proxied down to a
lower protection level’s protection handler. This is termed a proxied downcall. To accomplish a
proxied downcall, the higher level process copies the EX_CONTEXT state from the current protection level to the EX_CONTEXT for the protection level that will be receiving the downcall. Next it
writes the EX_CONTEXT_CPL state with the protection level of the downcall, the PC of the downcalls interrupt vector location, and sets the INTERRUPT_CRITICAL_SECTION status bit in the
EX_CONTEXT_CPL state. Lastly it executes an IRET thus mimicking an interrupt entry to the downcalled interrupt vector.
2.7.7
Lower Protection Level Interrupts
Asynchronous interrupts may happen at any time. They may even occur inside of code running at
a higher protection level than the level that is desired to handle the particular interrupt. The
architecture provides two manners for dealing with these interrupts. First, if the higher level system code does not want to be interrupted by the lower protection level’s interrupt, it may simply
mask the interrupt. The interrupt will be delivered when it is unmasked by restoring the interrupt
mask for the lower protection level. Second, if the higher level protection level code would like to
be interrupted by a particular interrupt, it should leave the interrupt unmasked. Now when the
interrupt arrives, the corresponding interrupt handler for the higher level system code will be
executed. It will be the responsibility for this code to proxy the call if needed to the lower level
operating system. The INTCTRL interrupt provides a convenient mechanism to deliver these interrupts when the higher protection level code completes.
2.7.8
Downcalls
While increasing the CPL is the most common way to request a service, in some situations you
might want to decrease the CPL to accomplish a task instead. In effect, this delegates processing
of an interrupt to code running at a lower protection level. For instance, on the Tile architecture, a
hypervisor might want to handle the Double Fault interrupt, to detect a faulty supervisor; however, if that interrupt were generated by an application program, it might want to allow the
supervisor to handle it instead.
In many cases, this is quite easy to do. If an interrupt occurs at CPL 0 that is handled at CPL 2, and
the interrupt service routine then decides that it should be handled at CPL 1 instead, it would perform the following algorithm, which we will term a downcall:
1
First, it reads the contents of exception context 2, and writes it to exception context 1; this will
be the PC of the interrupted code, a CPL of 0, and whatever the Interrupt Critical Status (ICS)
state was at the time of the interrupt.
2.
Next, it writes exception context 2 with a PC of the desired interrupt handler in PL 1, a CPL of
1, and an ICS state of true.
3.
Lastly, it executes an IRET instruction.
After the return from interrupt, the PL 1 interrupt service routine is executed, and it begins with
the exact same state it would have had if the interrupt had gone to it originally; when it is done, it
returns from the interrupt and the originally-executing code is resumed.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
41
Chapter 2 Tile Processor
If you want to delegate handling of an interrupt to the PL at the point where the interrupted code
was running, it can be more difficult. For instance, a supervisor might get an I/O interrupt, which
it would like to delegate to an application program, since the specific I/O device is owned by that
application. If the interrupted code was not in an interrupt-critical section, and the interrupt of
interest is not masked at the delegatee PL, returning to the lower-level interrupt routine can be
done as described above.
However, if the interrupted code was in an ICS, that procedure would destroy the state in the
lower PL’s exception context; if the to-be-delegated interrupt is masked. That procedure could
destroy application state which is being protected by that mask.
There are two ways to deal with this problem. One way is to virtualize the ICS state bit and the
interrupt masks for the lower PL by raising the MPL that controls access to their associated special-purpose registers. The delegating code then returns to the previously-executing code at the
lower PL. When that code accesses the ICS or interrupt mask SPRs, the delegating code’s GPV
interrupt handler can emulate the appropriate instruction. When one of those registers accesses
code, it clears the ICS bit, or unmasks the relevant interrupt. After emulating the instruction, the
GPV handler can reset the MPL to its original value and then downcall to the delegated interrupt
routine.
Another method requires the cooperation of the delegated-to code, only works when the delegatee is in an interrupt critical section, but is somewhat easier to implement. With this method, the
delegator arranges for the delegatee to get an notification, via an interrupt, indicating that it
should make a special service request to the delegator. Since that notification comes via an interrupt, it will not be delivered until the delegatee exits the critical section. The delegatee’s interrupt
routine is then executed, and it makes the special service request to the higher PL. That PL performs the downcall, but does not modify the delegatee PL’s exception context; when the delegatee
interrupt routine exits, it returns to the code that was interrupted by the notification interrupt.
To enable this second method, the Tile architecture provides four software-triggerable interrupts,
each of which can be targeted at any PL by setting an associated MPL register. Each interrupt is
asserted when a corresponding special purpose register is written with a 1. These interrupts are
the Interrupt Control [0:3] interrupts. In order to have these interrupts fire, the corresponding
INTCTRL_X_STATUS registers should have their low bit set. When this bit is set it allows software
to control when an interrupt fires. Refer to the descriptions of the INTCTRL_X_STATUS registers
that follow.
Interrupt Control 0 Status Register (INTCTRL_0_STATUS)
This register is used to specify the interrupt control 0 interrupt.
Speed
Slow
Minimum Protection Level
INTCTRL_0
,17&75/BB67$786
5HVHUYHG[
Figure 2-28: INTCTRL_0_STATUS Register Diagram
42
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Interrupt Model
Table 2-17. INTCTRL_0_STATUS Register Bit Descriptions
Bits
Name
Reset
Description
63:1
Reserved
0
Reserved
0
INTCTRL_0_STATUS
0
This field specifies the interrupt control 0 interrupt.
Interrupt Control 1 Status Register (INTCTRL_1_STATUS)
This register enables the interrupt control 1 interrupt.
Speed
Slow
Minimum Protection Level
INTCTRL_1
,17&75/BB67$786
5HVHUYHG[
Figure 2-29: INTCTRL_1_STATUS Register Diagram
Table 2-18. INTCTRL_1_STATUS Register Bit Descriptions
Bits
Name
63:1
Reserved
0
INTCTRL_1_STATUS
Reset
Description
Reserved
0
This register enables the interrupt control 1 interrupt.
Interrupt Control 2 Status Register (INTCTRL_2_STATUS)
This register enables the interrupt control 2 interrupt.
Speed
Slow
Minimum Protection Level
INTCTRL_2
,17&75/BB67$786
5HVHUYHG[
Figure 2-30: INTCTRL_2_STATUS Register Diagram
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
43
Chapter 2 Tile Processor
Table 2-19. INTCTRL_2_STATUS Register Bit Descriptions
Bits
Name
63:1
Reserved
0
INTCTRL_2_STATUS
Reset
0
Description
This register enables the interrupt control 2 interrupt.
Interrupt Control 3 Status Register (INTCTRL_3_STATUS)
This register enables the interrupt control 3 interrupt.
Speed
Slow
Minimum Protection Level
INTCTRL_1
,17&75/BB67$786
5HVHUYHG[
Figure 2-31: INTCTRL_3_STATUS Register Diagram
Table 2-20. INTCTRL_3_STATUS Register Bit Descriptions
Bits
Name
63:1
Reserved
0
INTCTRL_3_STATUS
Reset
Description
Reserved
0
This register enables the interrupt control 3 interrupt.
2.8 Software-Visible Dynamic Networks
2.8.1
Overview
The dynamic networks provide packet-based communication between Tiles, I/O devices, and
Memory. The Tile Architecture™ provides two dynamic networks for direct software access, the
User Dynamic Network (UDN) and the I/O Dynamic Network (IDN).
The UDN is typically used for application level communication while the IDN is typically used
for operating system, I/O, and hypervisor communications.
A specific implementation of the Tile Architecture may employ additional dynamic networks for
hardware-based communication between Tiles, I/Os, and/or memory. Cache coherency and
memory operations, for example, can use dedicated dynamic networks. The hardware usage of
implementation specific dynamic networks is not defined in this document.
2.8.1.1
Register Mapping and Interlock
The UDN and IDN are directly accessible by the Tile’s Arithmetic Logic Unit (ALU). The networks
are register mapped making them highly integrated with the program flow. This design provides
low latency and low overhead access for network reads and writes. For example:
44
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Software-Visible Dynamic Networks
•
{add udn0, r5, r6}
// This will add the contents of r5 to r6 and send the result to the UDN.
•
{add r5, r6, udn0}
// This will read a word from the UDN, add it to r6, and put the result
in r5.
•
foo = udn0_receive();
// This .c intrinsic will dequeue a single word of data from the UDN and
// store it in variable foo.
Access to the UDN and IDN is fully-interlocked. This allows an application to read the network
port and go to sleep automatically until data arrives. This method provides a low power state
with zero latency wake up. Similarly, on a network send if the network is not able to consume the
packet word immediately, the processor automatically waits until buffer space is available, which
saves a considerable latency over a polling or interrupt-driven scheme.
2.8.1.2
Routing
The dynamic networks are two-dimensional meshes. The sending software prepends a route
header on each packet. The route header contains the X and Y location information (along the X
and Y axes of the tile) of the target Tile or I/O device. Figure 2-32 shows the location information
for each tile in a processor. A route decision, based on a comparison of the X and Y coordinates in
the packet’s route header to the X and Y coordinates of the Tile, is made at each switch point as
the packet travels from the source node to the destination node. The Tiles’ X and Y coordinates
are stored based on the Tiles position in the mesh.
64-Bit Processor
Register File
3 Execution Pipelines
JTAG
Flexible
I/O
3,0
4,0
5
5,0
2,1
3,1
4,1
5,1
1,2
2,2
3,2
4,2
5,2
0,3
1,3
2,3
3,3
4,3
5,3
0,4
1,4
2,4
3,4
4,4
5,4
0,5
1,5
2,5
3,5
4,5
5,5
4x I2C
SPI
0,0
1,0
2,0
0,1
1,1
1,1
0,2
2x UART 2x USB
mPIPE
MiCA
ITLB
L1 DCache
DTLB
L2 Cache
Terabit
Switch
DDR3 Memory Controller (1)
DDR3 Memory Controller (0)
MiCA
Cache
L1 ICache
TRIO
4x GbE 10GbE 4x GbE 10GbE 4x GbE 10GbE 4x GbE 10GbE PCIe 2.0 PCIe 2.0 PCIe 2.0
4-Lane 4-Lane 8-Lane
SGMII XAUI
SGMII XAUI
SGMII XAUI
SGMII XAUI
SerDes
XAUI[3]
or
SGMII[12:15]
SerDes
SerDes
XAUI[2]
or
SGMII[8:11]
XAUI[1]
or
SGMII[4:7]
SerDes
SerDes
XAUI[0]
or
SGMII[0:3]
PCIe[2]
SerDes
SerDes
PCIe[1] PCIe[0]
Figure 2-32: Tile Processor Hardware Architecture
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
45
Chapter 2 Tile Processor
The routing algorithm is as follows:
The X dimension is checked first:
•
If x value is less than the SPR value, send the packet west. 1 At tile 1,0, the tile west of it is 0,0,
and so on.
•
If x value is greater than the SPR value, send the packet east.
•
If the value of x and the value of the SPR are equal, check Y.
Then the Y dimension is checked as follows:
•
If the packet destination is less than the SPR value, send the packet north.
•
If the packet destination is greater than the SPR value, send the packet south.
•
If the packet destination and SPR values are equal, send the packet to the tile demux logic.
The routing order can be changed to route the Y dimension first, followed by X dimension, by
writing the MEM_ROUTE_ORDER SPR.
2.8.1.3
Demultiplexing
Each packet sent to a Tile via the IDN or UDN contains an ID that is used by hardware to demultiplex multiple flows at the receiver. The UDN provides four hardware demultiplexed flows and
the IDN provides two.
Individual demultiplexed flows may be accessed directly using named registers: udn0, udn1,
udn2, udn3, idn0, and idn1. Hardware removes the route header word at the receiver, consequently software only sees the packet data at the named registers.
For more information about the UDNx registers, refer to the “Register Set”, in the TILE-Gx Instruction Set Architecture Specification (UG401).
1. Note that Special Purpose Registers can be either read or write registers. If they are read only, the SPR has a value that is stored in
the register.
46
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Software-Visible Dynamic Networks
UDN Network Interface
N
S
E
W
UDN_ID_0
UDN_ID_1
Tag Compare
UDN_ID_2
UDN_ID_3
Demux Buffer
SU0
SU1
SU2
SU3
Main Processor Interface
Figure 2-33: Demux
2.8.1.4
Receive-Side Buffering
To prevent head-of-line blocking, per-flow buffer space is provided at the receiving tile. This
allows packets to be dequeued in a different order than they were received at the switch point.
The depth of the buffer varies by implementation. By storing independent flows in separate
addressable queues, software can consume packet flows out of order relative to the arrival at the
switch without causing head of line blocking.
This undifferentiated buffer has programmable high-watermarks for UDN and IDN traffic. These
watermarks provide a hard partition of the buffer between UDN and IDN flows when deterministic, non-blocking performance is required on the IDN or UDN.
2.8.2
Ordering
Packets will be delivered in order for any two nodes for the same ID. Packets from different nodes
or using different IDs are not ordered with respect to each other. Packets are never interleaved at
the destination for a given flow.
2.8.2.1
Packet Format
Packets on the dynamic network use the following format:
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
ID
Dest_X
Dest_Y
8
7
6
5
4
3
2
1
0
Word 0: Header
Length
ID (Present if F=1)
Word 1
Data (1-128 Words)
Data
(1-128 words)
Figure 2-34: Dynamic Network Packet Format
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
47
Chapter 2 Tile Processor
Table 2-21. Field Descriptions
Word, Bits
Name
Description
Word0, 6:0
Length
Number of 32-bit words in the packet not including the route header.
0
Indicates a 128-word packet.
Word0, 17:7
Dest_Y
Y coordinate of destination Tile or I/O device.
Word0, 18:9
Dest_X
X coordinate of destination Tile or I/O device.
Word0, 63:30
Reserved
Must be zero.
Word1-128
Data
1 to 128 words of packet payload.
Hardware in the switch points uses the route header (word0) to route the packet from source to
destination.
2.8.3
Network Hardwall
The switches for the UDN and IDN provide an optional hardwall for independent communication
domains as well as virtualization. When an output port is protected via a bit in the U/IDN_DIRECTION_PROTECT SPR, no data will be sent out of the associated port. An interrupt allows
software to detect any attempt to send traffic to a protected port.
Software can handle a hardwall protection violation as follows:
1
Read U/IDN_DIRECTION_PROTECT SPR to determine which output port(s) generated the
violation.
2.
If multiple output ports are detecting a violation, choose one to process and use the output
port’s U/IDN_SP_STATE.OP_MUX_SEL indicator to determine source port. Otherwise, if only
one output port is in violation, the source port indicator can be used to determine which input
port needs to be extracted.
3.
Read the packet from the offending input port using the FIFO “spill” SPRsPacket length
must be interpreted so that exactly the entire packet is eventually extracted (though it may
require multiple trips through the ISR).
4.
If an entire packet is not available, return the packet from handler, creating a trap again when
subsequent words arrive.
5.
If an entire packet is sent, clear the locked indicator on the output port.
2.8.4
Interrupts
Interrupts are provided for the following conditions on the UDN and IDN:
2.8.5
•
A data word is available on udn0/1/2/3, idn0/1 (individually maskable).
•
Data is available on catch-all queue.
•
Data is sent on the UDN/IDN ports. This interrupt is triggered after the last word has been sent.
•
A switchpoint hardwall violation occurs (data word attempted to be sent to a protected output
port).
Deadlocks
The dynamic networks provide deadlock-free routing between nodes for a single traffic flow in
each direction. But software can induce a deadlock at the protocol level if circular dependencies
are created between sending and receiving data.
48
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Special Purpose Registers (SPRs)
Because dynamic network switch points are locked once a route header has traversed the switch
(packets are never interleaved), care must be taken to prevent deadlock due to a node sending a
partial packet.
2.9 Special Purpose Registers (SPRs)
The Tile Processor contains special purpose registers (SPRs) that are used for several reasons:
•
Hold state information and provide locations for storing data that is not in the general purpose
register file or memory
•
Provide access to structures such as TLBs
•
Control and monitor interrupts
•
Configure and monitor hardware features, for example prefetching, iMesh routing options, etc.
SPRs can be read and written by tile software (via mfspr and mtspr instructions, respectively),
and in some cases are updated by hardware.
The SPRs are grouped by function into protection domains, each of which can be set to an access
protection level, called the minimum protection level (MPL) for that protection domain. The “Protection” on page 29 defines how the MPLs are used. Software that attempts to access an SPR for
which it does not have the appropriate privilege level will cause a General Protection Violation
(GPV) interrupt, and information will be logged into the GPV_REASON SPR, as described in “General Protection Violation Reason Register (GPV_REASON) ” on page 25.
Click on the link for a complete list and detailed descriptions of SPRs (click here).
2.10Performance Counters / System Diagnostics
2.10.1 In-Tile System Devices
Each Tile provides a number of system services, diagnostics, and performance monitoring capabilities to truly support a complete system within the Tile.
2.10.1.1
Tile Timer and AUX_TILE_TIMER
Two 32-bit down counters with interrupts are provided in the Tile. The timers can be used for
operating system level “tick” functionality or for any other timing task.
The interrupt uses the TILE_TIMER minimum protection level (MPL). The timer is located in the
COUNT field of the TILE_TIMER_CONTROL SPR. This counter can be disabled and can be saved/
restored during context swapping. The UNDERFLOW bit of the TILE_TIMER_CONTROL SPR indicates that the counter has wrapped from 0 to (232)-1.
2.10.1.2
Cycle Counter
The Tile architecture™ provides a 64-bit free running cycle counter. The counter initializes to 0.
Read access to the counter is provided by the CYCLE SPR, which is part of the WORLD_ACCESS
MPL. Software can also write the counter via the CYCLE_MODIFY SPR, which is part of the
BOOT_ACCESS MPL.
2.10.2 Events
The performance monitoring and system debug capabilities of the Tile architecture rely on implementation-defined events. The specific set of events available to software varies depending on
implementation, but examples include cache-miss, instruction bundle retired, network word sent
and so on. Events are used to increment performance counters or interact with system debug/
diagnostics functionality.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
49
Chapter 2 Tile Processor
2.10.3 Counters
The Tile architecture provides four 32-bit performance counters. The counters may be assigned to
any one of the implementation specific events or to the other performance counter, providing 64bit counters if needed.
On overflow, the counter triggers an interrupt at the PERF_COUNT MPL. Performance counters
are controlled and monitored via the PERF_COUNT_0/1, PERF_COUNT_CTL, and PERF_COUNT_STS, AUX_PERF_COUNT_0/1, AUX_PERF_COUNT_CTL, and AUX_PERF_COUNT_STS
SPRs.
A current protection level (CPL) mask is provided in the PERF_COUNT_CTL SPR to prevent the
associated performance counter from running at specific protection levels.
2.10.4 Watch Registers
The Tile architecture provides programmable watch registers to track matches to implementationspecific multi-bit fields. For example, match on a specific fetch PC, or a specific speculative memory reference Virtual Address. The match comparison uses the 64-bit WATCH_MASK SPR to qualify
which bits participate in the comparison.
The result of the watch comparison is an event which can be selected by the performance counters. Hence an interrupt can be raised on the Nth occurrence of the watch match comparison by
preloading a performance counter with (2 32)-N and selecting the SPCL.WATCH event in the performance counter.
The watch register can be used to trigger a performance counter interrupt by preloading the
counter with ((232)-1)-N where N is the number of matches at which an interrupt is desired. Note
that the interrupt is not precise, hence it is not possible to trap the exact instruction that caused
the match. Precise trapping based on VA or PC is generally provided by software debugging services such as GDB which rely on TLB protection attributes and instruction emulation to set
breakpoints on specific VAs or PCs.
2.10.5 Pass SPR
The Tile architecture provides a 64-bit PASS SPR that acts as a general purpose scratchpad. This
SPR is typically used for higher level signaling in Tile simulators. Two additional aliases to the
same PASS SPR are provided via the DONE and FAIL SPRs. On the TILE-Gx device, writes to the
PASS, FAIL, and DONE SPRs are specific events that can be fed into the performance counters or
the diagnostics functions.
2.10.6 Broadcast Networks
TILE-Gx provides four dedicated 1-bit broadcast networks intended for use with performance
monitoring and diagnostics. Each network consists of a single input wire and a single output wire
on each compass point. When the input wire asserts, the Tile asserts all four of the output wires in
the next cycle. Tiles assert their output wires when the event selected by the DIAG_BCST_CTL
SPR occurs. Additionally, software may assert a Tile’s broadcast network by writing to the
DIAG_BCST_TRIGGER SPR.
The assertion of a broadcast network is an event that may be fed into the performance counters for
a given Tile. Using this mechanism, up to four Tile events can be fed to other Tiles allowing six
effective performance counters for a single Tile. These counters also provide a low-latency interrupt mechanism between Tiles. Note that it is not recommended to use this mechanism in
implementation independent code, since the behavior is not defined by the Tile architecture.
50
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Performance Counters / System Diagnostics
The broadcast networks provide a hardwall on each output port in the DIAG_BCST_MASK SPR.
The hardwall can be used to prevent cycles in the broadcast networks and to partition the Tile
fabric into independent domains.
The broadcast networks are additionally visible and controllable in the Rshim via the
DIAG_BROADCAST on channel-0. 1
2.10.7 System Software Debug
Debugging of applications software typically uses industry standard methods such as the GNU
debugger (GDB). However, when debugging low level hypervisor or system software, it is sometimes necessary to extract processor state data without the assistance of cooperating debugger
software running on the tile.
2.10.7.1
Tile Debug Port
To aid with system software debugging, TILE-Gx provides access to essential processor state data
such as fetch-PC, registers, and specific SPRs. This Tile state is readable from any of the following
sources:
•
JTAG
•
USB
•
UART (in protocol mode – see “Protocol Mode” on page 211)
•
I2C Slave
•
Software running on another Tile PCIe
Access to Tile state is via a JTAG instruction-like interface that can be addressed directly from
JTAG or via a JTAG control interface located in the Rshim (see Section 15.5 Rshim JTAG for more
information on the JTAG_CONTROL, JTAG_SETUP, and JTAG_DATA registers).
Read/Write Access
Reads and writes to the Tile debug state are managed by sending commands to the Tile via the
following JTAG instruction registers:
Table 2-22. JTAG Instruction Registers
Name
Number
Size (bits)
Description
EXT_MODE
0xF8
2
Must be written with 0x1 to enable Tile debug.
INST_SEL
0xF6
TWIDTH*8
Debug command register – one in each Tile, complete row of
Tiles is concatenated. Format of this register is defined below.
TWIDTH=69
TWIDTH=70
BLOCK_SEL
0xF4
17
One-hot Tile row select. Bits[16:8] are reserved and must be 0.
1. Refer to the “Glossary” on page 703 for a definition of Rshim.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
51
Chapter 2 Tile Processor
The format of the INSTSEL command shifted into each Tile is as follows:
(TWIDTH-1):55
54
R
must be 0
CMD
0: NOP
1: Read
2: Write
53
52
48
OBJ_SEL
Object within the tile
being accessed.
47
32
OFFSET
Offset within the
structure being
accessed.
31
0
DATA
Shifted in for writes,
shifted out for reads.
This command is shifted in to the West-most Tile in the row selected by BLOCK_SEL and then
through the rest of the Tiles in the row and finally out of the East-most Tile and back to the JTAG
controller. Each Tile has its own 69-bit command register so 69*8 bits must be shifted in with the
non-accessed Tiles’ CMD fields set to 0.
A write to the debug port on a given Tile would be handled as follows:
1
Write to the EXT_MODE register to set it to 1.
2.
Set BLOCK_SEL register to indicate which Tile row is to be accessed (for example, a Tile in the
3rd row down would use BLOCK_SEL = 0x0_0008.
3.
Shift in 69*8 bits to the INST_SEL register. The Set of 69 bits corresponding to the Tile to be
written would have CMD=2 and OBJ_SEL/OFFSET/DATA fields set appropriately. All other
Tile commands would be 0.
Reads require two steps. The read command is shifted exactly as a write except CMD=1. Then a
second shift of 69*8 bits is used to extract the data.
52
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Performance Counters / System Diagnostics
Objects
The objects listed in Table 2-23 are defined to provide Tile debug state.
Note: The diagnostic access path is 16-bits wide. The LSB’s of JTAG address are used to select a
16-bit slice of wider structures, for example for a 64-bit structure JTAG[1:0] would be used
to access bits [15:0] when 0, [31:16] when 1, etc.
Table 2-23. Tile Debug Objects
Name
OBJ_SEL (in HEX)
Data Width
Description
L1D data
0x09
144
1. JTAG address [14] selects the way1/way0 [1: way1, 0: way0].
2. JTAG address [13:12] selects a ram, a 16-byte chunk of a
cacheline.
3. JTAG address [11:4] is [L1_IDX_MSB:L1_IDX_LSB].
4. Each index in a ram has 144 bits, the total 144x4=576 bits
including
512-bit data and 64-bit parity.
5. ram0 has byte-15 to byte-0, with layout [byte-15 parity, byte15 data,
byte-7 parity, byte-7 data, byte-14 parity, byte-14 data, byte-6
parity,
byte-6 data….byte-0 data].
6. ram1 has byte-31 to byte-16, with layout [byte-31 parity, byte31 data,
byte-23 parity, byte-23 data, byte-30 parity, byte-30 data,
byte-22
parity, byte-22 data….byte-16 data].
7. ram2 has byte-47 to byte-32.
8. ram3 has byte-63 to byte-48.
For more information refer to Section 2.4.4 Cache Micro Architecture in the Tile Processor Architecture Overview for the
TILE-Gx Series (UG130).
L1D tag
0x0a
54
1. JTAG address [9:2] is [L1_IDX_MSB:L1_IDX_LSB].
2. JTAG address [1:0] is indexing 54-bit input/output.
3. Each index in the ram has 54 bits, including 2 ways with 26
bits tag and
1 bit parity each way.
4. The physical layout of the 54 bits is [way1 parity, way1 tag,
way0 parity, way0 tag].
For more information refer to Section 2.4.4 Cache Micro Architecture in the Tile Processor Architecture Overview for the
TILE-Gx Series (UG130).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
53
Chapter 2 Tile Processor
Table 2-23. Tile Debug Objects (continued)
Name
OBJ_SEL (in HEX)
Data Width
Description
L2 data
0x02
266
1. JTAG address [16:15] selects the ram [3: way7/6, 2: way5/4,
1: way3/2, way 1/0].
2. JTAG address [14] selects the odd/even way [1: odd, 0:
even].
3. JTAG address [13:5] is [L2_IDX_MSB: L2_IDX:LSB].
4. The object includes bottom half of the L2 data in 4 rams.
5. Each index in a ram has 266 bits, with [265:256] the ecc
code, and [255:0] the bottom half of a cacheline data.
For more information refer to Section 2.4.4.2 L2 Cache Subsystem in the Tile Processor Architecture Overview for the TILE-Gx
Series (UG130).
L2 data
0x01
266
1. JTAG address [16:15] selects the ram [3: way7/6, 2: way5/4,
1: way3/2, way 1/0].
2. JTAG address [14] selects the odd/even way [1: odd, 0:
even].
3. JTAG address [13:5] is [L2_IDX_MSB: L2_IDX:LSB].
4. The object includes upper half of the L2 data in 4 rams.
5. Each index in a ram has 266 bits, with [265:256] the ecc
code, and [255:0] the upper half of a cacheline data.
For more information refer to Section 1.6.2 RegisterFile (RF) in
the Instruction Set Architecture for TILE-Gx (UG401).
L2 tag
0x03
272
1. JTAG address [13:5] is [L2_IDX_MSB:L2_IDX_LSB].
2. Output of the two rams are total 272 bits including 1-bit parity,
8-bit id, 25-bit tag each way, 8 ways total.
3. 272 bits layout: 8-way interleaving [parity, id, tag].
4. Parity does not cover 8-bit id.
For more information refer to Section 2.4.4.2 L2 Cache Subsystem in the Tile Processor Architecture Overview for the TILE-Gx
Series (UG130).
54
L2 lva
0x05
64
L2 State.
1. JTAG address [10:2] is [L2_IDX_MSB:L2_IDX_LSB].
2. L2_LVA is a two-port ram, with 1 read port and 1 write port.
3. Output of the ram is 64-bit.
4. a 64-bit layout: 8-bit LRU at [63:56], followed by 8-way
interleaving [parity, 1-bit touch, 3-bit share, 1-bit dirty, 1-bit
valid].
L2 directory
0x07
72
L2 Share Tracker.
1. JTAG address [13:12] selecting way-pair [7/6, 5/4, 3/2, 1/0].
2. JTAG address [11:3] is [L2_IDX_MSB:L2_IDX_LSB].
3. Output of the two rams are total 72 bits [odd way 36 bits,
even-way 36 bits].
L2 SDN
0x04
256
SDN Ingress Buffer.
1. JTAG address [8:4] is indexing both 32-entry rams.
2. Output of the two rams are total 256 bits.
L2 RTF
0x06
156
L2 Retry FIFO.
1. JTAG address [8:4] is indexing a 32-entry ram.
2. Output of the two rams are total 156 bits.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Performance Counters / System Diagnostics
Table 2-23. Tile Debug Objects (continued)
Name
OBJ_SEL (in HEX)
Data Width
Description
L2 MAFa
0x08
91
L2 Missing Address File.
1. JTAG address [5:3] is indexing a 8-entry MAF.
2. JTAG address [2:0] is indexing a 7-slice 15-bit data [14:0].
3. The top bit of each slice [15] is the “not-success” signal.
DMUX
0x0f
74
UDN and IDN Ingress Buffer.
1. JTAG address [10:3] is indexing both 256-entry rams.
2. Output of the two rams are total 74 bits.
For more information refer to “Demultiplexing” on page 46
REGFILEa
0x11
64
1. JTAG address [7:2] is indexing 64-entry REGFILE.
For more information refer to Section 1.6.2 RegisterFile (RF) in
the Instruction Set Architecture for TILE-Gx (UG401).
QUIECE
0x12
1. set_quiesce by writing 1,
clear_quiesce by writing 0,
[1] = cbox,
[0] = sbox.
For more information refer to “Quiesce” on page 56.
SPR
0x10
64
1. JTAG address [1:0]: selecting chunk of the holding register.
2. JTAG address [15:2]: SPR index (when access-bit = 1).
3. JTAG address [16]: access-bit.
= 1 to write/read the SPR;
= 0 to write/read the holding register.
For more information refer to Section 1.5 Special Purpose Registers in the Instruction Set Architecture for TILE-Gx (UG401)
and Section 2.3.1 Special Purpose Registers (SPRs) in the Tile
Processor Architecture Overview for the TILE-Gx Series
(UG130).
L1I datab
0x0b
0x0c
65
65
1. JTAG address[13] is the MSB of the ram address, selects the
line from way 0 or 1 within the addressed set.
2. jtag_address[12:5] is the LSBs of the ram address, indexing
the set (1 of 256).
3. JTAG address[4:3] selects the instruction 0 - 7 within the line.
4. JTAG address[2:0] selects the 16 bit slice from the
instruction, 0->15:0, 1->31:16, etc 4-> {15'b0, odd parity}, 5,
6, 7 unused.
For more information refer to Section 2.4.2 Front End Micro
Architecture in the Tile Processor Architecture Overview for the
TILE-Gx Series (UG130).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
55
Chapter 2 Tile Processor
Table 2-23. Tile Debug Objects (continued)
Name
OBJ_SEL (in HEX)
Data Width
Description
L1I tagb
0x0e
0x0d
74
74
1. JTAG address[9:3] is the set.
2. jtag_address[2:0] selects the 16 bit slice from the tags.
[73:37] - way 1, [36:0] - way 0.
Within each 37 bits -- [36:28] - way predict, [27] - valid, [26] odd parity, [25:0] - physical address[39:14].
For more information refer to Section 2.4.2 Front End Micro
Architecture in the Tile Processor Architecture Overview for the
TILE-Gx Series (UG130).
a.
b.
Read only access.
The Icache is physically and logically split as EVEN and ODD. For instructions within a given cacheline, instructions
0, 2, 4, 6 are in the EVEN DAT object and instructions 1, 3, 5, 7 are in the ODD DAT object (e.g. bit 3 of the instruction
address determines which DAT object to use). For tags the same applies, however bit 6 of the instruction address determines which TAG object to use).
When accessing tile objects via JTAG, it is important that the tile is quiesced. This prevents
resource conflicts. Also, when extracting data from the trace buffer, software should first clear the
trace-buffer-enable via a JTAG write to the tile’s trace buffer control SPR. This prevents a trace
buffer write from being corrupted by a JTAG access to the trace buffer.
2.10.7.2
Quiesce
In order to read Tile debug state via the debug port, the Tile must first be quiesced. This process
can be handled via the broadcast networks described above or by writing the QUIESCE bit via the
debug port. If the latter technique is used, software must first assign the DIAG_BCST_CTL.QUIESCE_SEL SPR to at least one of the broadcast networks to enable the Tile’s quiesce capability.
This is typically done by boot software to allow debugging anytime later.
A Tile that has been quiesced will cease instruction fetch operations. Dynamic network traffic will
continue to pass through the Tile and multi-Tile cache coherency operations will continue to be
processed. So other Tiles can continue to operate and communicate normally.
2.11Boot Processes and Data Format
2.11.1 Boot Flow
The Tile Processor™ is booted by pushing boot data to the Tiles over the UDN using the Rshim’s
packet generator. Thus any device that can access the Rshim can boot the chip. A level-0 boot program is built into the hardware and runs immediately following hardware reset. This program
interprets the incoming boot stream according to the format (Table 2-24).
Boot-capable interfaces such as USB, StreamIO, or PCIe provide a hardware means for communicating with the Rshim for booting. Since all bootable devices communicate with the Rshim for
booting, the flow is essentially the same regardless of the source device. Boot data can also be
“pulled” into the chip via the SPI or I2C-Master interfaces. This is controlled by the RSHIM_BOOT_CONTROL register. This register is initialized based on the BOOT_MODE strapping pins but
can be overridden by software. It is NOT reset when the chip is software-reset, thus it is possible
to change the boot mode and reset the chip. This is useful for operations like POST that might
require different boot images and/or sources depending on POST results.
For more information about the BOOT_MODE strapping pins, SROM_SPI_SCK and I2CMx_SCL,
refer to the appropriate TILE-Gx data sheet for your processor.
56
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Boot Processes and Data Format
Table 2-24. L0 Boot Format
Field
Description
WORD-0
Number of words in the block, not counting the first two and last words
WORD-1
64-bit address where the block should be stored
WORD-2 through WORD-N+1
'N' 64-bit words of L1 boot code.
WORD-N+2
Address to which the L0 boot code should jump. If 0, then another block will be read,
otherwise the address to begin execution of the L1 code.
2.11.2 Chip Modes and Reset Behavior
There are two special processor modes that are especially relevant to the booter: Physical memory
mode and Cache-as-RAM mode.
Physical memory mode allows the processor to do memory references without first needing to
program virtual-to-physical address translations into the TLB. In this mode, the 40-bit physical
address (PA) used for each load, store, or instruction fetch is constructed by using the low 40 bits
of the 42-bit virtual address (VA) used by the processor. The processor is in physical memory
mode when it emerges from reset, and all of the boot code executes in this mode; the TLB is not
enabled until just before the hypervisor begins execution.
Cache-as-RAM mode allows the processor to execute code and perform load and store instructions before the memory controllers have been configured, without using any external memory.
This is accomplished by a very small change to the behavior of the level two cache (the L2$).
While in cache-as-RAM mode:
•
A read to an address that is valid within the cache (a read hit) is handled normally: the data is
returned to the processor.
•
A read to an address that is not valid within the cache (a read miss) is also handled normally:
a request is sent to a memory controller, asking for the data. Typically, when running in cacheas-RAM mode, the memory controllers (and the tile SPRs that tell it which one to use for which
memory region) are not configured, so such a request will never complete, causing the chip to
hang. Thus, the booter never makes a memory reference that could cause a read miss.
•
A write to an address which is valid within the cache (a write hit) is handled normally: the data
in the cache is modified.
•
A write to an address that is not valid within the cache (a write miss) is handled specially. Normally, such a request would result in an eviction (where the data currently in the relevant cache
line, if modified, would be written to memory); a background fill (where the data in the target
memory cache line would be read into the cache); and finally, a modification of the data in the
cache. In cache-as-RAM mode, the first two steps are skipped. Instead, a target cacheline is
picked (a field in an SPR determines which way in the cache is used for this), the data is written
to that line, and its tags are set to match the newly written address. Note that this modification
does not set the dirty bit on the cache line; in fact, any write to a cache line while in cache-asRAM mode clears its dirty bit. This prevents the line (which might have a completely false
physical address in its tags) from ever being flushed to memory on eviction.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
57
Chapter 2 Tile Processor
Like physical memory mode, the processor is in cache-as-RAM mode when it emerges from reset,
and all of the boot code executes in this mode. Note that the use of an SPR to designate the way
targeted on write miss means that each byte of the cache must be written to before it can be used
as memory.
About the Software Stack
For information about the software stack, refer to Tilera Hypervisor Theory of Operation, which is
available at: $TILERA_ROOT/doc/html directory.
2.11.3 Boot FIFO
For a description of how the host interface is used during the boot sequence, refer to “Rshim Host
Interface” on page 234.
For a description of how software implements flow control for boot transactions, refer to “Boot
and Rshim Regions” on page 81.
58
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 3
D OUBLE D ATA R ATE SDRAM
(DDR3) I NTERFACE
3.1 Overview
The TILE-Gx36™ supports two independent DDR3 memory channels. There is one memory controller for each memory channel. Each memory controller can be operated independently. The
memory controller supports up to 800 MHz memory clock and 1600 MT/s data rate (PC3-12800).
The memory controller supports sixty-four bits of data plus 8 bits of ECC (optional). The memory
controller supports up to 16 ranks, and supports x4, x8, and x16 devices.
Any tile can communicate with any memory controller through the on-chip mesh network. The
reQuest Dynamic Network (QDN) is used for handling memory and MMIO requests. The
Response Dynamic Network (RDN) is used for memory responses, MMIO responses, as well as
interrupts.
The memory controller uses a CAM-based scheduler to improve memory bandwidth and latency.
A hardware memory striping mode is supported to distribute memory loads across multiple
memory controllers. DRAM address mapping is configurable. The priority level for memory
requests is also configurable.
The memory PHY (MPHY) layer handles all the physical aspects of operations, such as DRAM
interface timing and MPHY interface bring-up.
Core Clock
Domain
DRAM Clock
Domain
RDN1
Retime
FIFO
RDN0
Retime
FIFO
Ingress
Control
Read/Write
Response
FIFO
Ingress Path
ECC
MMIO
QDN0
QDN1
Retime
FIFO
Mreq
Retime
FIFO
Mreq
Memory
PHY
CMD
Buffer
CMD
Buffer
Scheduler
Protocol
Controller
DATA
Buffer
CMD
Buffer
Mesh Network Interface
Egress Path
Figure 3-1: Memory Controller Block Diagram
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
59
Chapter 3 Double Data Rate SDRAM (DDR3) Interface
3.2 Interfaces
The memory controller is attached to the external DRAM through the memory PHY interface. The
memory controller is attached to the TILEs through the network interface.
3.2.1
DDR3 Interface
Major characteristics of external memory are shown in Table 3-1.
Table 3-1. DDR3 Interface Characteristics
Feature
Description
Memory clock frequency
Up to 800 MHz (maximum)
Bit rate
Up to 1600 MT/s
Data width
64 bits, plus optional 8 bits ECC
Rank supported
Up to 16 ranks
Bank supported
Up to 128 banks, optimized for 32 banks
ECC support
Single bit correction, Double bit detection
DRAM parameters
DDR3 parameters are fully programmable
Voltage
1.5V or 1.25V
3.2.2
Network Interface
The memory controller has two QDN network connections (QDN0 and QDN1), and two RDN network connections (RDN0 and RDN1). The networks run at the core clock frequency. The QDN0
and RDN0 are physically connected to one TILE, and the QDN1, and RDN1 are physically connected to another TILE.
Any tile can send a memory request to the memory controller, through one of the two QDN network connections. The selection of the QDN network is static and is controlled by a configuration
register in the TILE. The memory controller will always return a response/ACK on the same port
from which that the request comes, that is a request that comes from the QDN0 network will be
returned on the RDN0 network; a request that comes from the QDN1 network will be returned on
the QDN1 network.
The memory controller supports memory mapped I/O (MMIO). A TILE sends MMIO packets to
access the configuration registers in the memory controllers for configuration, status, etc. The
MMIO requests must come from the QDN0 network. Any MMIO requests coming from QDN1
network are considered errors, and the error status will be logged.
After the memory requests are received from the network, the memory requests are stored in the
command buffer (CMD buffer), the write data associated with write requests are stored in the
data buffer (DATA buffer). The memory controller performs the link level flow control over the
mesh network connections.
3.3 Data Flows
Terminology
egress is referred to the direction from a TILE to the external memory; ingress is referred to the
direction from the external memory to a TILE.
60
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Data Flows
3.3.1
QDN Memory Read Request Flow
A QDN memory read request sent by a tile arrives at a QDN port of the memory controller. A
retime FIFO is used to handle the clock domain crossing between the core clock domain and the
memory clock domain. The DRAM clock frequency is totally decoupled from the core clock
domain. An arbiter is used to handle the arbitration between the two QDN ports. Once a QDN
memory read request is selected by the arbiter, the QDN memory read request will be queued in
the CMD Buffer. To optimize for memory bandwidth and latency, the memory controller uses a
scheduler to service the memory requests based on external memory status, memory request attributes, and scheduling parameters. The QDN memory read request is converted to external
memory commands (such as activate, precharge, read) before it is sent to the memory PHY.
3.3.2
RDN Memory Read Response Flow
The memory controller constructs the header portion of an RDN read response from the read
response FIFO. The memory controller assembles the data portion of an RDN read response
packet from the read data returned from the external memory. After the retime FIFO, the RDN
read response packet is converted back into the core clock domain. To improve the mesh network
utilization, it can be configured so that no bubble cycle will be inserted / wasted in the middle of
one RDN read response packet.
Each QDN memory read request has a SEND_COPY attribute. Normally, the RDN memory read
response packet is sent to the home tile through the network connection. If the SEND_COPY attribute is asserted, then the same read response packet will be sent to the original request tile
through the other network connection. The QDN memory read request packet contains information on the location of the original requesting tile.
3.3.3
QDN Memory Write Request Flow
The QDN memory write request flow is similar to the QDN memory read request flow. The difference comes from the fact that write request headers are stored in the CMD Buffer, and write
request data are stored in the DATA Buffer. The write request header and the write request data
are in different format and size. The memory controller dispatches the DDR3 write command and
the DDR3 write data at different times.
3.3.4
RDN Memory Write Response Flow
The RDN memory write response flow is similar to the memory read response flow. The difference is that the write response can be sent to an RDN port without waiting for the memory
operation to be finished in the external memory. This is referred as an early write ack.
Each memory write request has a NO_ACK attribute bit. When the NO_ACK attribute is asserted, no
write response will be dispatched. This is referred as a no write ack.
3.3.5
Non-Cacheline Write Flow and Masked Write Flow
DDR3 SDRAM is burst-oriented, with the burst length being programmed to eight by the memory
controller.1 The burst of eight transfers 64 bytes of data over the 64-bits data on the physical interface. The cacheline size on the TILE-Gx processors is also sized as 64 bytes. One cacheline access
maps to one burst of eight DDR3 access.
If a QDN memory write request is not a full cacheline write, then memory background data will
be fetched first so that the integrity of the background can be maintained and the optional ECC
can be calculated. The memory controller writes back to the external memory after merging the
1. Note that the burst length can be specified either as four or eight, but it is only programmed to be a burst length of eight.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
61
Chapter 3 Double Data Rate SDRAM (DDR3) Interface
background data (from the fetch) with the QDN memory write data (from the Data Buffer). In
order to improve memory bus utilization, other memory requests can be scheduled between the
background fetch and the merged data write-back.
A masked write is a QDN memory write request where some of the bytes are masked (or not to be
written). The memory controller treats masked writes the same as it treats the non-cacheline
writes.
3.4 Ordering
3.4.1
Out of Order Dispatch
Between a given source (that is, a tile or an I/O) and a destination (that is, a memory controller)
pair, QDN memory requests are always routed in-order across the mesh network. However, the
memory controller can dispatch memory requests out of order to improve performance on bandwidth and latency.
The only exception to this rule is when a physical address conflict is detected between any two
memory requests. An address conflict occurs when two memory requests overlap (partially or
entirely), no matter whether the requests come from the same tile/I/O or not. These two memory
requests will be dispatched in the order they are received by the memory controller, not necessarily back to back, other non-conflict requests can be dispatched in between.
3.4.2
Out of Order Response
Between a source (that is a memory controller) and a destination (that is a tile or an I/O), RDN
responses are always routed in order. However, the memory controller might choose to return the
RDN responses out of order from the order that the corresponding QDN memory requests are
received. The destination (that is a tile or an I/O) uses a unique tag inside the RDN response packets to differentiate outstanding responses. The memory controller copies the tag to the RDN
response packet from the tag in the QDN memory request.
3.5 Addressing
A QDN request packet provides a 40-bit physical address. The upper bit(s) are the controller
selection bits, which are used to select the memory controller. For example, in the TILE-Gx36
implementation, bit 39 of the address bus is used to select the memory controller. The lower
address bits, bit 38 to bit 0, are used by each memory controller, that is up to 512 G bytes of
address space can be supported by each memory controller, as shown in Figure 3-2.
39
0
TILE PA Bits
Controller Selection Bit (s)
Memory Controller PA Bits
Figure 3-2: PA Address Mapping
62
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Addressing
3.5.1
Memory Controller Striping
Each memory controller has visibility to the lower address bits (bit 38 to bit 0). These physical
address (PA) bits are identical to the PA bits sent from TILEs. The controller selection bit(s) can be
configured as a hash function of other PA bits so that the memory requests are striped across multiple memory controllers.
PA_hashed[39] =
PA[39] ^
(((PA[25]&E[2]) ^ PA[18] ^ PA[15] ^ (PA[14]&E[1]) ^ (PA[9]&E[0])) & M[1])
PA_hashed[38] =
PA[38] ^
(((PA[24]&E[2]) ^ PA[17] ^ PA[16] ^ (PA[13]&E[1]) ^ (PA[10]&E[0])) & M[0])
Where
E[2:0] are configuration bits for hash function enable. E[2:0] are defined by the
MEM_STRIPE_CONFIG[10:8] register.
M[1:0] defines the striping modes. M[1:0] come from the selected two bits of MEM_STRIPE_CONFIG[7:0] indexed by PA[39:38].
Table 3-2. Striping Mode
M[1:0]
Description
00
No load balancing
01
Load balancing between memory controller pair (0,1) and/or pair (2/3)
10
Load balancing between memory controller pair (0,2) and/or pair (1,3)a
11
Load balancing between all controllers (0,1,2,3)a
a.Applies to a TILE-Gx processor with four memory controllers.
3.5.2
DDR Address Mapping (from Memory Address Mapping)
The physical address bits are mapped to the external memory address, as shown in Figure 3-3.
The LSB field has three bits, because the physical interface has 64 bits (eight bytes). The width of
the column field and the width of row field depend on the external DRAM components. The
bank field has three bits, because each rank is comprised of eight banks. The rank field has four
bits, because each memory controller supports up to 16 external DRAM ranks.
38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
9
8
7
6
5
4
3
2
1
0
PA
rank
row
bank
column
LSB
DDR Address
Figure 3-3: DDR Address Mapping
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
63
Chapter 3 Double Data Rate SDRAM (DDR3) Interface
3.5.3
Memory Rank/Bank Hashing
The row field and the column field are identical to the corresponding PA bits. The rank and the
bank fields can be configured as a hash function of other PA bits, as the following examples show.
•
Rank[1] = PA_rank[1] ^ ((PA[11] ^ PA[12] ^ PA[25]) & CFG_ADDR_HASH[4])
•
Rank[0] = PA_rank[0] ^ ((PA[10] ^ PA[16] ^ PA[24]) & CFG_ADDR_HASH[3])
•
Bank[2] = PA_bank[2] ^ ((PA[9] ^ PA[6] ^ PA[22] ^ PA[23]) & CFG_ADDR_HASH[2])
•
Bank[1] = PA_bank[1] ^ ((PA[8] ^ PA[18] ^ PA[21] ^ PA[26]) & CFG_ADDR_HASH[1])
•
Bank[0] = PA_bank[0] ^ ((PA[7] ^ PA[19] ^ PA[20] ^ PA[27]) & CFG_ADDR_HASH[0])
Where the bit location of PA_rank and PA_bank are determined by the width of the row field
and the column field.
3.5.4
Logical Rank and Physical Rank Mapping
The memory controller maps the 4-bits logical rank bits to the external 16 physical rank selection bits. The physical rank selection bits are connected to the DRAM sockets (or components)
on the board. Depending on what type of DRAM modules are populated in the DRAM sockets, for
example, single-rank, dual-rank, or quad-rank, the mapping function can be configured accordingly through the MSH_DDR3_DIMM_CFG register.
3
0
rank
15
logical
0
rank
physical
Figure 3-4: Rank Mapping
3.6 Scheduler
3.6.1
Memory Page Management Policy
The memory controller supports an open-page policy and a close-page policy. The page management policy can be configured by using the MSH_CONTROL register.
When the close-page policy is used, a DRAM page will be closed after its memory reference is
done. In general, the close-page policy provides more deterministic memory latency, while it also
consumes more power on external memories. When the memory access patterns are random (that
is minimal spatial and temporal locality), and the memory request rate is light, the close-page policy could provide lower memory latency in some applications.
When the open-page policy is used, a DRAM page stays open once it is opened (with the hope
that the same page will be accessed in the near future). When the open-page policy is enabled, the
memory controller must decide if the auto-precharge command should be applied at the time
when a memory read command, or a memory write command should be dispatched. The decision
is based on scheduled memory requests that are to be dispatched in the near future. As such, the
open-page policy could result in similar decision on page management as the close-page policy
when the memory access patterns are random and the memory request rate is heavy.
64
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
DIMM Support
3.6.2
Memory Request Reordering
The memory controller uses a 32-entry CAM to assist the scheduling decisions. The decisions are
based on several considerations.
•
Ordering must be enforced if memory requests reference to the same address.
•
The Read-first policy is used. The memory controller has separate read and write queues. Reads
have higher priority than writes, that is read > write.
•
The Hit-first policy is used. The memory controller checks if the QDN memory request references a DRAM open page or a DRAM close page. For example, a read request to an open page
(that is a read page hit) has a higher priority over a request to a closed page (that is a read page
miss). This will be noted as read page hit > read page miss. Similarly, write page hit > write
page miss.
•
The Priority-first policy is used. Priority level can be assigned to QDN memory requests. For
example, memory requests from a mission critical task could be assigned a high priority. Memory requests with a high priority tend to be serviced first, as such, the response latency tend to
be lower.
•
TILE source awareness. The memory controller checks for the source of the QDN memory
requests for memory request scheduling.
•
Anti-starvation controls are in place. If a request has not been served for certain amount of
time, this request times out and is given a higher priority. Therefore, timed out read page hit >
timed out read page miss > timed out write page hit > timed out write page miss > read page
hit > read page miss > write page hit > write page miss. The starvation threshold is softwareprogrammable. Note that timed out writes have a higher priority over reads. If the starvation
values are programmed such that writes are more likely to time out then reads, then writes
appear to have a higher priority. Writes tend to be serviced later than reads, and many writes
can sit in the CAM. Another threshold can be configured so that writes are treated as starved
if there are too many write requests pending in the CAM.
•
Memory requests are stored in virtual queues and sorted based on memory rank and bank.
Load balancing is applied to reduce the overhead associated with precharge/activate.
•
Memory requests come from one of two QDN networks. Load balancing is applied to improve
fairness among the two networks and reduce potential network congestion.
•
To reduce the DRAM overhead due to turnaround between read and write, it is desirable to
stay in the current read (or write) queue for some amount of time, which is programmable by
software.
•
To reduce the DRAM overhead due to turnaround between ranks, it is desirable to stay in the
same rank for some amount of time.
3.6.3
Memory Command Reordering
Each QDN memory read/write request might result in multiple external memory commands
(that is precharge, activate, column read/write). In order to reduce the potential overhead associated with precharge and activate, the memory control may reorder the precharge/activate
relative to its associated read or write request. For example, a sequence of pchg1, act1, read1,
pchg2, act2, read2 can be converted to pchg1, pchg2, act1, act2, read1, read2.
3.7 DIMM Support
The memory controller supports many kinds of memory modules, which are described in the following sections.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
65
Chapter 3 Double Data Rate SDRAM (DDR3) Interface
3.7.1
Serial Presence-Detect EEPROM Support
The SPD EEPROM can be accessed through the one of the on-chip I 2C master interfaces.
3.7.2
Temperature Sensor
Certain DIMM modules include on-board I2C temperature sensor with integrated serial presencedetect (SPD) EEPROM. System designers can program the registers on the SPD EEPROM to customize the temperature-sensing configuration. When critical temperature thresholds have been
exceeded, temperature sensor will assert the event to the memory controller. The edge-triggering
event from “not-exceeding” to “exceeding” the temperature threshold will assert an interrupt bit
in the memory controller.
3.7.3
Address/Command Parity
Certain DIMM modules support parity detection on address/command bus. The memory controller will generate an even parity bit for the address and command (A/BA/RAS_N/CAS_N/WE_N).
When a parity error is detected, the DIMM module will assert the parity error to the memory controller. The edge-triggering event from no-error to error will assert an interrupt bit in the memory
controller. Two parity error interrupts are implemented to differentiate which DIMM modules
have detected the parity error.
3.7.4
RDIMM Control Word Access
RDIMM control words provide configuration of certain device features on the DIMMs. The control words are accessed by the simultaneous assertion of first two DDRx_CS_N on a DIMM. Refer
to MSH_DDR3_USER_INIT_1 register for more details on control word programming.
3.7.5
Memory PHY Training
The memory controller has various initialization state machines to assist the external DRAM initialization sequence, and to bring-up the memory PHY interface. Refer to the memory PHY
registers for more details.
66
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 4
(TRIO)
PCI E C ONTROLLER A RCHITECTURE
4.1 Overview
The TILE-Gx™ processor’s PCIe Controller provides services to integrate the Tile processors with
a PCI system. Both endpoint and root complex modes are supported. Several data movement and
communication models are supported simultaneously. These can be summarized as:
•
Tile Programmed I/O (PIO): Tile software communicates directly with the PCI system using
Memory Mapped I/O (MMIO) loads and stores. PIO can be used for configuration in root complex mode.
•
Host PIO: Host software or a connected PCI device communicates with Tile physical memory
space using reads/writes. The host or PCI device can use DMA transfers to move data to/from
Tile physical memory space.
•
Push DMA: Bulk data transfer from Tile physical memory to PCI address space, and is typically
used in endpoint mode.
•
Pull DMA: Bulk data transfer from PCI address space to Tile physical memory, and is typically
used in endpoint mode.
•
Ingress Scatter: Writes from the PCI system consume buffers enqueued at the PCIe controller.
The various controller interfaces are shown in Figure 4-1.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
67
Chapter 4 PCIe Controller Architecture (TRIO)
MAC RX
(P/NP)
MAC RX
(CPL)
MAC TX
Region
Match
Read Data
PCI PIO
Completions
Tile
Request
Manager
Pull DMA (PCI Reads)
Boot
&
Rshim
Write Data
Write Data
Push DMA (PCI Writes)
Request
Tracker
ePkt
Tile
Writes
SCTR
DMA
Picker
Picker
SCTR
Q
0|1|2|...|N
Tile
Messages
0|1|2|...|N
Read Desc
CTL
Message
Read Data
SW Post Write PCI
CPL Data
Buf
descFetch
PIO
Buffer
descFetch
SW Post
RingPtr
Tile PIO
Read/Write
SW Post
RingPtr
Figure 4-1: PCIe Controller
4.1.1
Communication and Data Transfer
Bulk data transfer between the I/O device and Tile software is through coherent shared memory
reads and writes. Data is moved directly to and from caches to minimize the off-chip memory
bandwidth requirement. Configuration and PIO traffic utilize MMIO. Interrupts are delivered to
Tile software through the IPI mechanism.
4.1.2
PHY Sharing
On devices that support more than one PCIe port, each port can have its own PCIe interface hardware and DMA engines or the DMA functions can be shared between the ports. The PHY can also
be shared between multiple interfaces in order to optimize utilization of SERDES lanes, but the
software interface provides completely independent ports and any sharing of DMA resources is
non-blocking between ports. See Figure 4-2.
68
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
MMIO Interface
SerDes Lanes SerDes Lanes SerDes Lanes
PCIe/
StreamIO
MAC0
PCIe/
StreamIO
MAC1
PCIe/
StreamIO
MAC2
TRIO
(Transaction I/O)
5,2
Tiles
Figure 4-2: PHY/DMA Sharing Example
Implementation Note: The TILE-Gx36 device has 12 Gen2 SERDES lanes and three PCIe ports. This
allows configuration of a x8 link with one x4. Or three x4s (or smaller).
4.2 MMIO Interface
Tile software communicates with the PCIe controller via loads and stores in MMIO space. The
PCIe interface interprets the physical address in the MMIO loads and stores as shown in Figure 43 and Table 4-1.
37 36 35 34 33 32 31
Region
Offset
Figure 4-3: MMIO Address Mapping Format
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
69
Chapter 4 PCIe Controller Architecture (TRIO)
Table 4-1. MMIO Address Mapping Description
Bits
Name
Description
37:36
Reserved
Reserved
35:32
Region
Selects
0
1
2
3
8-15
16
313:q
Offset
Up to 4GB of offset within the region being accessed.
access to one of the following:
Config Space (port level is in offset 17:16.
Push DMA Posts
Pull DMA Posts
Scatter Queue FIFOs
PIO (Eight regions)
MAP MEM (Interrupt) Registers
System software can use large pages to cover multiple regions with the same PTE, or smaller
pages to provide limited access to PCIe structures and PIO space.
The PIO, push-DMA, and pull-DMA regions are described in the sections that follow.
4.3 PIO Communication
The PCIe controller’s PIO interface provides direct PCI memory-space communication with the
PCI port(s). This interface is typically used for lower bandwidth communication such as root-todevice configuration, DMA setup and interrupts. Loads and Stores of 1, 2, 4, and 8 bytes are translated into PCI config or memory space reads and writes then sent to the PCI system.
The PIO interface supplies eight independent PCI translation regions to map the Tile’s physical
address into a PCI address. These are configured with the TRIO_TILE_PIO_REGION_SETUP
registers.
Each PIO region also has configuration register settings for the MAC it is associated with and the
PCIe access type. The access type can be configured as memory, config, or I/O.
When operating as a memory PIO region, the translation region’s base address is appended as the
MSBs to the low 32 bits of the Tile’s MMIO load/store address to form the PCI address. A request
tracker is used to match incoming completions with outstanding MMIO loads.
When operating as a config PIO region, the offset bits are interpreted as bus, device, and function
number.
I/O transactions use the low 32-bits of the MMIO address as the I/O address.
4.3.1
Memoryless Operation
In order to support operation without using any local memory controllers, the PCIe interface
allows PIO regions to behave as memory controllers. Thus the main memory backing is provided
by the PCIe subsystem rather than the local device’s memory controllers.
System boot software must configure the Tiles’ MEM_MAP registers to point to the PCIe controller
instead of a memory interface. Memory reads and writes or 1, 2, 4, 8, or 64 bytes will be converted
using the PIO_REGIONs into PCI reads and writes.
The PIO_REGION is selected by the same physical address bits used for MMIO-based access and
is defined in the TRIO_MMIO_ADDRESS_SPACE definition. The remaining address bits are converted to a PCI address as described above.
70
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Push DMA
The strict order mode for the PIO region must be used for memoryless operations in order to preserve the Tile’s memory transaction order.
Memoryless mode is intended for very tightly controlled environments and as such introduces
several restrictions.
TRIO must not be allowed to perform any IO_READs to the Tile memory system if those reads
might miss in the L3 cache, because the home tile will forward the request to memory (TRIO),
which does not handle this type of access.
This restriction means that the system cannot use PUSH DMA to move data up to the host unless
it is guaranteed to hit L2. The TILE-Gx cannot support ingress reads in memoryless mode, other
than to the Rshim, unless they too are guaranteed to hit.
Memoryless Systems requiring communication with Tile software must use write-write messaging for communication and generally avoid host-to-tile reads as well as Tile-to-host DMAs.
Additionally, the SEND_COPY attribute is not supported in memoryless mode. Systems employing memoryless operations via TRIO must set each Tile’s XDN_ATTR_SEND_COPY_DISABLE bit
in the CBOX_MSR SPR.
4.3.2
Ordering
MMIO loads and stores are converted into PCI reads and writes respectively. The ordering of the
PCI transactions follows the PCI order model. This model allows writes to pass reads. Hence an
MMIO store can complete prior to an MMIO load even if the store was issued by the Tile after the
load and even if they are to the same address. System software utilizing PCI PIO must be written
to operate correctly within the PCI order model.
An optional configuration setting in the TRIO_TILE_PIO_REGION_SETUP register allows the
region to be configured as strict order. In this mode, no reads or writes will be sent to the PCI
interface if a previous read (or config write) is outstanding. This mode insures a strict order of
transactions on the PCI bus, but sacrifices both read and write bandwidth.
Additionally, each PIO_REGION has a separate 32-entry FIFO to reduce head of line blocking
between transactions targeting different devices/MACs. If Tile software requires ordering
between two different PIO_REGIONs, it must issue a memory fence between the operations.
A PIO write transaction will not be completed from the Tile’s perspective until it has been sent to
the PCIe MAC. Thus a memory fence is sufficient establish ordering between regions and a fencing-read is not required.
4.4 Push DMA
Push DMA is used to move data from Tile memory space to PCI memory space with low Tile processing overhead. The PCIe controller utilizes descriptor rings and a gather engine to collect data
and move it to PCI.
4.4.1
Descriptors
Push-DMA transactions are described by descriptors. The descriptors are written into rings in the
Tile’s memory space and are either posted to the DMA engine via MMIO or automatically read by
the descriptor fetch engine in hunt mode.
The push DMA interface’s descriptor rings provide independent flows for QoS or applicationbased differentiation of flows. Each ring is processed in-order by the push DMA engine, but
ordering between different rings is not maintained. Any ring can be configured to move data to
any MAC.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
71
Chapter 4 PCIe Controller Architecture (TRIO)
Implementation Note: The TILE-Gx36 device provides 32 Push-DMA descriptor
rings.
The descriptors support gather functionality by allowing a PCI push transaction to be assembled
from multiple Tile-side buffers. The descriptor format is shown in Figure 4-4.
VA[31:0]
Size
Gen
sMOD NTF C
0x00
BSZ
VA[41:32]
0x04
PCI Address [31:0]
0x08
PCI Address [63:32]
0x0C
Figure 4-4: Push/Pull-DMA Descriptor Format
Table 4-2. Push/Pull-DMA Descriptions
Bits
Name
Description
31
Gen
Generation Number. Used to indicate valid descriptor in ring.
13
C
Chaining Designation. Always 0 for pull DMA.
0
Un pointer.
1
pointer. Next buffer descriptor (for example, VA)
stored in first 8 bytes of the buffer.
For s, the BSZ field is used to determine the size of the first buffer in the
chain. Subsequent buffers are sized using the size field of the buffer
descriptor.
12:10
BSZ
Buffer Size. Encoded size of the first buffer in the chain when C is equal to 1.
0
128 bytes
1
256 bytes
2
512 bytes
3
1024 bytes
4
1664 bytes
5
4096 bytes
6
10368 bytes
7
16384 bytes
14
NTF
Signal interrupt for this ring when the transaction is complete.
29:16
Size
Total number of bytes to move for this transaction.
When sMode is equal to 1, this field is encoded (see below).
When sMod=0 and Size=0, the transaction is a NOP. A SizeZero (NOP)
descriptor with NTF-1 can be used to generate an interrupt when all older
descriptors have completed. No read/write packets will be sent to the MAC
and no Tile memory will be affected.
15
sMod
0
1
72
When 0, the Size field specifies the total byte count for the
transaction.
When 1, the Size field is encoded as 2^(N+14) for N in {0...0}:
0=16KB
1=32KB
...
6=1MB
All other encodings of Size field are reserved when sMode=1.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Push DMA
4.4.2
Request Partitioning
The push-DMA engine gathers data for the transaction and partitions the PCIe write packets
based on PCIe alignment and sizing rules. There are no alignment restrictions on the Tile side or
PCI side addresses.
The MPS setting in the TRIO_PUSH_DMA_RG_INIT_DAT register determines the maximum packet
size transmitted on the link. The DMA packet generator will also never cross an MPS-aligned PCI
Address boundary. This can result in an extra packet being sent on the link. But for all but the
smallest transfers, this effect is negligible.
4.4.3
Notification and Flow Control
As push-DMA transactions are processed, there are different types of notifications that software
can require.
4.4.3.1 Descriptor Rings Slot Available Notification
An MMIO-readable head pointer allows Tile software to determine how much space is available
in a DMA ring. This head pointer does NOT indicate that the associated descriptors have been
completely processed, only that ring locations older than the head can be reused for new
descriptors.
4.4.3.2 Transaction Complete Notification
The NTF bit in each descriptor can be used to generate an interrupt to Tile software upon completion of the DMA transfer. A running count of descriptors processed for a given ring is also
provided so that software can determine which descriptors have been completely processed.
4.4.3.3 PCI System Notification
Push DMA transfers typically involve writing bulk data to the PCI system and subsequently messaging the PCI system that the transaction is complete. This is done by putting an additional push
descriptor, typically to the MSI location on the host, into the descriptor rings after the bulk data
transfer descriptor. Hence no additional special-purpose hardware is required.
4.4.4
Flush/Fence
When an application crashes or a ring needs to be reconfigured, the flush/fence mechanism
allows hardware resources to be reclaimed without impacting unrelated rings and flows.
The flush flow below must be used when a ring with FLUSH_MODE bit of the TRIO_PUSH_DMA_RG_INIT_DAT_ASID register asserted takes a TLB fault. Attempting to restart such a ring
without first flushing might cause packet corruption or hardware lockup.
The procedure for recovering a push DMA ring’s hardware resources is:
1.
Set the ring’s frozen stall, and flush bits in the TRIO_PUSH_DMA_DM_INIT_DAT_SETUP register to prevent additional descriptors from being fetched and processed. This will also flush
already-fetched descriptors and buffer data.
2.
Issue an MF to insure that the register setting has completed.
3.
Poll the FLUSH_PND bit of the TRIO_PUSH_DMA_CTL register until it is clear to insure that the
descriptor flush has completed.
4.
Set the FENCE bit of the TRIO_PUSH_DMA_CTL register to initiate a coherence fence on outstanding push DMA data reads.
5.
Poll the FENCE bit of the TRIO_PUSH_DMA_CTL register to insure that all outstanding
requests have completed.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
73
Chapter 4 PCIe Controller Architecture (TRIO)
6.
MF.
7.
Poll the FENCE bit of the TRIO_PUSH_DMA_CTL register until it is clear.
8.
Poll the FLUSH_PND bit of the TRIO_PUSH_DMA_CTL register to insure that buffer flush has
completed.
9.
Software can now use the COUNT bit of the TRIO_PUSH_DMA_REGION_VAL register to determine which descriptors have been processed by hardware and which were not.
10. Write TRIO_PUSH_DMA_DM_INIT_DAT_HEAD register to zero for the associated ring.
11. Write TRIO_PUSH_DMA_DM_INIT_DAT_DESC_STATE0 to 0x1 and TRIO_PUSH_DMA_DM_INIT_DAT_DESC_STATE1 to 0x0. This places the descriptor ring fetch into its initial state.
12. Read Interrupt Status register (TRIO_INT_VEC*_RTC) from the bound Tile to flush any outstanding interrupts.
13. MF.
4.5 Pull-DMA
The Pull-DMA engine is used to move data from the PCI system to the Tile memory system. Similar to push-DMA, the pull-DMA utilizes descriptor rings to manage transactions (Figure 4-5).
MAC RX
(CPL)
MAC TX
PCI Reads
Write PCI
CPL Data to Tile Memory
Request
Tracker
Request
Partition
Picker
0|1|2|...|N
descFetch
Figure 4-5: Pull DMA
The descriptor format is identical to push-DMA and shown in Figure 4-4, however only a single
Tile-side buffer descriptor is supported for each transaction. Hence software must post multiple
pull-DMA descriptors to perform a scatter operation.
Implementation Note: The TILE-Gx36 device provides 32 Push and 32 Pull-DMA descriptor rings.
74
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Flush/Fence
For each PCI read request that is generated, the Tile-side physical address is calculated using the
technique described in 4.7 Address Translation. If a translation fault occurs, the associated DMA
ring is frozen until software installs a proper translation. PCI read transactions prior to the translation fault will be completely processed.
4.5.1
Pull DMA Notifications and Flow Control
As pull DMA transactions are processed, there are several types of notifications required.
4.5.2
Descriptor Rings Slot Available Notification
The descriptor rings are maintained identically to the push DMA descriptor rings.
4.5.3
Transaction Complete Notification
Once the pull-DMA transfer has completed and the data is visible to Tile software, an interrupt
can be optionally delivered via the IPI mechanism by setting the NTF bit in the descriptor. Software can also poll for transaction completion information by reading the associated
TRIO_PULL_DMA_REGION_VAL.COUNT register.
4.5.4
Request Tracker
The Pull-DMA engine partitions the transaction specified in the descriptor into legal PCIe
requests based on PCIe alignment rules and max-request-size limitations. Each PCIe read request
packet is tracked using a hardware request tracker. As completions are returned on the PCIe port,
the request tracker entry provides the Tile side memory address to which the completion data is
returned.
The request tracker is used to detect a number of exception cases including unexpected completions and request timeouts. The request tracker state can be cleared by software or when the
DL_Down state is entered on the PCIe link.
4.6 Flush/Fence
When an application crashes or a ring needs to be reconfigured, the flush mechanism allows hardware resources to be reclaimed without impacting unrelated rings and flows. The procedure for
recovering a pull DMA ring’s hardware resources is:
1.
Set the ring’s freeze, flush and stall bits in TRIO_PULL_DMA_DM_INIT_DAT_SETUP to prevent
additional descriptors from being fetched and processed. Note that if the DMA engine detects
an error1, the ring will automatically be frozen and stalled so only the flush bit will need to be
asserted (w/o clearing the freeze and stall bits).
2.
Issue an MF to insure that the register setting has completed.
3.
Poll the FLUSH_PND bit of the TRIO_PULL_DMA_CTL register to insure that all outstanding
requests have completed.
4.7 Address Translation
Push and pull DMA descriptors provide data pointers in virtual address space. These are translated to physical addresses utilizing the entries in the shared TLB. Each ring is associated with a
4-bit ASID. The ASID provides the context in which to evaluate the VA.
1. For example, a TLB fault on a ring with the FLUSH_MODE bit in the TRIO_PULL_DMA_RG_INIT_DAT register is
asserted.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
75
Chapter 4 PCIe Controller Architecture (TRIO)
If the translation for a descriptor fails, the associated ring is frozen and an interrupt is sent via the
IPI mechanism. Software must install a translation and re-enable the ring to continue processing.
PCIe addresses are similarly translated using the same shared I/O TLB. Each ingress mapping
region has an associated ASID and BASE_VA. This is used to form the VA provided to the I/O
TLB.
The shared I/O TLB is managed by system software and has the following properties:
•
There is one central I/O TLB shared between Push-DMA reads, Pull-DMA writes, and MAP
reads/writes.
•
There are 16 TLB entries per ASID and 16 ASIDs.
•
There is a dedicated interrupt binding for Push-DMA, one for Pull-DMA, and one for MAPmem.
•
Push-DMA fault data is captured in TRIO_TLB_PULL_DMA_EXC register.
•
Push-DMA faults cause the associated descriptor to be retried until the fault is handled. Thus
subsequent interrupts for the Push-DMA binding can occur and there will be a slight drop in
Push-DMA performance while the miss is being handled.
•
Alternatively, each push-DMA ring can be put into a drop-on-fault mode via the FLUSH_MODE
bit in the associated TRIO_PULL_DMA_RG_INIT_DAT register. When a push DMA descriptor is
discarded, the PUSH_DESC_DISC interrupt will be triggered.
•
Pull-DMA fault data is captured in TRIO_TLB_PULL_DMA_EXC.
•
Pull-DMA faults will cause pull DMA writes to stall thus only one Pull-DMA fault will occur
at a time. If the Pull-DMA fault is not handled in a timely fashion, the Pull-DMA engine will
stop issuing new PCIe read transactions for the associated MAC and Pull-DMA performance
will drop.
•
Alternatively, each pull-DMA ring can be put into a drop-on-fault mode via the associated
FLUSH_MODE bit of the TRIO_PULL_DMA_RG_INIT_DAT register. When data is dropped in this
mode, a PULL_DATA_DISC interrupt will be triggered.
•
MAP reads and writes each have their own TLB fault interrupt and associated TRIO_TLB_MAP_WR_EXC/TRIO_TLB_MAP_RD_EXC register.
•
MAP faults will cause the subsequent reads/writes to stall. This could cause the PCIe requesting agent to hit a read timeout or violate timeliness rules for posted transactions. This can also
cause deadlock in systems where PIO is being used. Hence most systems will not use the faultin flow on MAP accesses but rather use fixed TLB mappings to provide windows into PA space.
The I/O MMU described in the following section is more appropriate for dynamic page mappings.
4.7.1
I/O MMU
Incoming PCIe requests that target a MAP MEM region use the I/O MMU table if the region’s
USE_MMU bit is set. The MMU provides many more translations than the I/O TLB and is generally
used to map 32-bit I/O addresses into PA space above 4GB or to provide custom caching attributes (homing) for specific transactions and devices.
Implementation Note: The TILE-Gx36 provides 4096 MMU table entries.
The I/O MMU does not provide a fault-in capability hence operations targeting the MMU must
coordinate beforehand to allocate and configure entries.
76
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Mapping Regions
The index into the MMU table is formed by taking N bits of the VA starting at the MMU page
size. N is based on the size of the table (for example, 12 bits for a 4K table).
This base index is then added to (ASID<<M) where M is Log2TableSize-Log2NumASIDs. This
allows the table to be partitioned into NumASIDs sub-tables. Thus the BASE/LIM pair and ASID
allow each MAP region to be associated with a portion of the MMU table and allow the MMU
table to be shared/protected between different MAP MEM regions.
Each MMU table entry provides a PA, homing attributes, and a valid bit. If a request targets an
MMU entry with the valid bit clear, the request will be redirected to the PANIC_PA and an
MMU_ERROR interrupt will be triggered.
4.8 Ingress Mapping Regions
PCIe requests arriving on the link are compared to the BAR or BAL registers in PCI configuration
space to determine if the request properly targets the device. Packets whose addresses fall within
the devices range are then compared to “mapping regions” to determine what action is taken
with the packet and its data.
A mapping region consists of a 4KB aligned base and limit within PCI address space and regionspecific attributes described below.
Requests that fail to match any regions will be shunted to the PANIC_PA stored in the TRIO_PANIC_MODE_CTL register. This also will trigger the MAP_UNCLAIMED interrupt. Requests that target
the PANIC_PA prior to system software configuring the tile (e.g. prior to boot) will yield unpredictable system behavior. System designers must coordinate memory-mapped communication to
the TILE-Gx such that PANIC_PA requests are not generated prior to TILE-Gx boot software preparing memory space and exiting cache-as-RAM mode.
PCI Address Space
Tile Memory
Read(s)
PCI Read
Tile Memory Region
(VA, ASID)
MemRegion
MemRegion
Tile Response(s)
BAR (Endpoint)
!BAL (Root Complex)
PCI Completion(s)
Boot
Rshim
Figure 4-6: PCI Region Example
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
77
Chapter 4 PCIe Controller Architecture (TRIO)
4.8.1
Tile Map Memory Regions
Tile memory regions provide a mapping between PCI addresses and Tile memory addresses.
Incoming read and write PCI requests that match a memory region are converted to Tile memory
reads and writes.
Implementation Note: The TILE-Gx36 provides 16 Tile Memory Regions.
Each Tile memory region contains a 4KB aligned Tile-side base VA and ASID in addition to the
PCI address base and limit. The VA for a request is calculated as (IO_ADDRESSBASE)+BASE_VA.
This VA is processed through the I/O TLB or I/O MMU, based on the region’s USE_MMU bit to
produce a Tile PA and associated attributes, such as hash-for-home and caching attributes.
To prevent system level deadlock, the I/O TLB fault mechanism is not typically used to fault in
translations dynamically. Instead, the I/O MMU table is used when the system requires translations to be updated dynamically.
A single PCI request might need to be partitioned into multiple Tile memory transactions. The
controller implements tracks outstanding Tile memory reads in order to form proper PCIe
completions.
4.8.1.1 MAP-MEM Interrupts
Each map-memory region has a set 16 general-purpose interrupt bits. These bits are accessible
both from the Tile side and from the PCI Express interface. The bits can be configured to trigger
Tile-side interrupts. Each map-memory region allows its interrupt vector to be configured to dispatch the associated Tile-side interrupt based on level or edge semantics.
The interrupt vector itself can be accessed from the PCI Express or Tile MMIO interfaces via one
of four different registers. Each register has unique access semantics as described below:
Table 4-3. Register Behaviors
Register Number
Read Behavior
Write Behavior
0 (R/W)
Returns current value
Writes a new value.
1 (RC/W1TC)
Returns current value, clear all bits.
1
0
Clears bits if written with 1.
Leaves intact if written with zero.
2 (R/W1TS)
Returns current value.
1
0
Sets bits if written with 1.
Leaves intact if written with zero.
3 (R/SetBit)
Returns current value.
Sets the bit indexed by the data value (for
example, data value indicates which bit is to
be set).
4-7
Exhibits the same behavior as registers 0-3,
but without any “edge” interrupts.
Exhibits the same behavior as registers 0-3,
but without any “edge” interrupts.
From the Tile side, these registers are accessible via the MAP region within MMIO space. From
PCI Express, these registers appear as the first 64 bytes of the associated map-memory region.
Each register occupies 8-bytes of address space. Registers 4-7 behave the same as the associated
register in locations 0-3, but they do not generate any edge interrupts.
When the INT_ENA bit is set for the associated MAP MEM region, the region’s address space is
formatted as shown in Figure 4-7.
78
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Mapping Regions
MAP_MEM_BASE
MAP_MEM_INT0 Register
MAP_MEM_BASE+7
MAP_MEM_BASE+8
MAP_MEM_INT1 Register
MAP_MEM_BASE+15
MAP_MEM_BASE+16
MAP_MEM_INT2 Register
MAP_MEM_INT3 Register
MAP_MEM_BASE+23
MAP_MEM_BASE+24
MAP_MEM_BASE+31
MAP_MEM_BASE+32
Same as MAP_MEM_INT0-3,
except no edge-interrupt generation.
MAP_MEM_BASE+63
MAP_MEM_BASE+64
Normal MAP-MEM Address
Space Mapped to Tile Memory Space
MAP_MEM_lim
Figure 4-7: MAP Region within MMIO Space
4.8.1.2 Map-Region Ordering
Each map-memory region can be configured into one of three modes via the ORDER_MODE field in
the associated TRIO_CFG TRIO_MAP_MEM_SETUP register:
•
UNORDERED: Writes to different cachelines are not ordered with respect to each other. Reads
will never complete until all older writes in all mapping regions have become visible to Tile
software.
•
STRICT: Writes and reads are strictly ordered, even to different cachelines and across different
mapping regions (including Scatter Queue (SQ) regions). This mode might result in decreased
write performance.
•
REL_ORD: Write ordering is enforced if the incoming packet’s relaxed-ordering attribute is
clear. If the packet’s relaxed-ordering bit is set, the writes are unordered. Reads will never complete until all older writes in all mapping regions have become visible to Tile software.
The interrupt registers are updated using the STRICT order model defined above. That means all
previous write data, even to UNORDERED regions, will be visible to Tile software prior to the interrupt state registers being updated. The Tile-side interrupt will also follow the STRICT order
model and be triggered at the point the interrupt register write is made visible.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
79
Chapter 4 PCIe Controller Architecture (TRIO)
4.8.2
Scatter Queue Regions
Scatter queue (SQ) regions provide a means for mailbox/doorbell communication via PCI
Express. Each scatter region consists of a descriptor FIFO used to provide target VAs for incoming
PCIe requests. Both reads and writes are supported to SQ regions.
Each time a PCIe write or read is received within a scatter region, its Tile-side VA is calculated by
adding the offset within the region to the VA provided by the descriptor at the head of the
descriptor FIFO.
Implementation Note: The TILE-Gx36 provides eight SQ Regions.
A single write-only 8-byte register in the last 8-bytes of the region provides the ability to dequeue
the descriptor and/or generate an interrupt (doorbell). The register format is described in TRIO_MAP_SQ_DOORBELL_FMT in the register specification and shown in Table 4-4.
Table 4-4. TRIO_MAP_SQ_DOORBELL_FMT Register Bit Descript6ions
Bits
Name
Type
Reset
Description
1
POP
WO
0
When written with a 1, the descriptor at the head of the associated
MAP_SQ's FIFO will be dequeued.
0
DOORBELL
WO
0
When written with a 1, the associated MAP_SQ region's doorbell
interrupt will be triggered once all previous writes are visible to Tile
software.
Writes to the TRIO_MAP_SQ_DOORBELL_FMT register must be 8-bytes or less (for example not
combined with a write to the data portion of the base and limit region). Writes smaller than 8bytes to the upper 7-bytes of the TRIO_MAP_SQ_DOORBELL_FMT register have undefined behavior. Larger writes that happen to overlap the TRIO_MAP_SQ_DOORBELL_FMT register will be
written to Tile memory and will not access the register. Reads to the TRIO_MAP_SQ_DOORBELL_FMT will also access Tile memory and have no impact on the register.
The TRIO_INT_VEC3_W1TC vector register contains the interrupts associated with the map SQ
regions. These interrupts are referred to as “doorbell interrupts”. The lower eight bits contain
information about the associated region’s doorbell interrupt. The next eight bits are the associated
region’s descriptor-dequeue interrupt. The TRIO_INT_VEC3_W1TC register is paired with the
TRIO_INT_VEC3_RTC register, which provides the mechanism for clearing the associated doorbell interrupts when it is read.
The scatter queue descriptor format is described in Table 4-5.
Table 4-5. TRIO_MAP_SQ_REGION_WRITE_VAL Bit Descriptions
80
Bits
Name
Type
Reset
Description
63
INT_ENA
WO
0
Indicates that an interrupt is requested when this descriptor is
dequeued.
62:42
Reserved
41:12
VA
11:0
Reserved
Reserved
WO
0
4KB-aligned VA to be used on incoming MAP_SQ writes. The VA for an
incoming write will be IO_ADDRESS - MAP_SQ_BASE + VA.
Reserved
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Mapping Regions
Tile software is responsible for keeping the 64-entry descriptor FIFO full. The ‘I’ (interrupt) bit
can be used for flow control so that it knows when more descriptors are needed.
If more than 64 descriptors are written into the descriptor FIFO, it might drop descriptors and
trigger the MAP_SQ_OVFL interrupt.
If an incoming PCIe transaction falls within an SQ_REGION’s base and limit but there are no valid
descriptors in the FIFO, the transaction will be directed to the PANIC_PA bitfield stored in the
TRIO_PANIC_MODE_CTL register. This will also trigger the MAP_SQ_EMPTY interrupt.
Accesses to the write-only DOORBELL bitfield of the TRIO_MAP_SQ_DOORBELL_FMT register is
only allowed if there is a valid descriptor in the associated FIFO. If a DOORBELL write arrives
from the PCIe and there is no valid descriptor, the access will be directed to the PANIC_PA bitfield, as described above.
As with MAP-Memory regions, SQ_REGIONs provide UNORDERED, STRICT, and REL_ORD modes
as described above. The doorbell interrupt is always delivered as a strict-order operation. In other
words, the interrupt will not be triggered until all older writes are visible to Tile software.
4.8.3
Boot and Rshim Regions
Rshim access is provided through a dedicated mapping region. The Rshim region consists of 1MB
of PCI address space mapped to the Rshim’s address space. The address is interpreted as {channel,offset}. For more information, refer to the TRIO_MAP_RSH_ADDR_FMT register.
The Rshim region is mapped into PCIe BAR-0 automatically on hardware reset. Additionally, if
the port is enabled for root complex (RC) operation, the 1MB region will be enabled in the low
address range of the negatively decoded base and limit region. Thus the Tile processor can be
booted from either a root complex or endpoint device.
The Rshim consists of 64-bit registers. In order to provide compatibility with 32-bit (dword) PCIe
accesses, the Rshim contains a set of registers for mapping the 64-bit register accesses into indirect
32-bit accesses.
Boot is achieved by writing the Rshim’s packet generator interface through the Rshim map
region.
Software can implement flow control to prevent boot transactions from backing up onto the PCIe
bus. This flow control consists of polling the Rshim’s packet generator data-words-sent counter to
see how much data has been sent.
The interface can sync up to 4KB of boot data without back pressuring the MAC, so the flow control only needs to be done once for every 4KB of data. One example software algorithm would be
to send 4KB of data, then poll until 2KB had been sent, then send another 2KB, etc.
A separate FIFO provides a path for non-boot transactions to the Rshim. Thus, as long as software
does not send more than 4KB of boot data, read and write accesses to Rshim registers will complete without blocking.
Only 1,2,4, and 8 byte writes and reads are supported to the Rshim region. A read request for
more than 8 bytes will result in a completion with an UnsupportedRequest status. Writes
larger than 8-bytes will only complete the first 8-bytes written. The remaining bytes will be
dropped.
4.8.4
Map Fence
When a MAP-MEM/SQ region needs to be reconfigured, the MAP-Mem Fence mechanism can be
used to guarantee that all outstanding transactions have completed. To reconfigure a MAP-MEM
or SQ region, the following procedure should be used:
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
81
Chapter 4 PCIe Controller Architecture (TRIO)
1.
Disable the region by clearing its MAC_ENA bits in the TRIO_MAP_MEM_SETUP or TRIO_MAP_SQ_SETUP register.
2.
Issue an MF to insure that the register write has completed.
3.
Set the FENCE bit in TRIO_MAP_MEM_CTL.
4.
Poll the FENCE bit until it clears.
5.
Poll the RDQ bitfield of the TRIO_MAP_DIAG_FSM_STATE register until it sets to zero to ensure
all reads have issued to Tile memory space.
Software can now be sure that no new transactions will arrive from the associated region, no
TLB misses will occur from the region, and all older writes will have completed.
6.
If software needs to ensure that all reads have completed (for example all data fetched from
Tile memory), the FENCE bit of the TRIO_PUSH_DMA_CTL register must also be written and
polled at this time.
4.9 Panic Mode
Since Tile-side software is required to fill I/O TLB translations to allow forward progress of MAPMEM/SQ transactions, it is possible for the PCIe bus to become clogged with transactions if the
Tile-side software has crashed or otherwise stopped responding to TLB fill requests. In order to
allow system recovery without crashing the PCIe host system, an optional timer configured in the
TRIO_PANIC_MODE_CTL register detects when trio is preventing forward progress on MAC
transactions.
When the panic timer fires, any pending TLB misses are aborted and all MAP-mem/SQ transactions that TLB miss are instead shunted to the pre-configured physical address stored in
PANIC_PA. This guarantees forward progress so that system software can access Rshim registers
for debug and reset.
4.10Connection to mPIPE
It is sometimes desirable to treat data from the PCI system as “packet” data and pass it through
the TILE-Gx processor’s mPIPE™. This allows, for example, packets collected by a host-connected
Network Interface Card (NIC) to be off-loaded to TILE-Gx for processing as in Figure 4-8.
82
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Connection to mPIPE
TILE-Gx PCIe NW Offload Card
Application
D
D
R
3
Pull DMA
Driver
PDE
PCIe
Network
NIC
Host
Chipset
Host
Processor
DRAM
Figure 4-8: Host Offload Model
To support distributing data to the mPIPE, TILE-Gx processor provides dedicated eDMA gather/
loopback channels that allow data from Tile memory space to be processed through the classification, load-balance, distribution, and notification services in the mPIPE’s iDMA path.
This allows data to be collected into buffers by the PCIe controller’s Pull DMA function in the offload model shown above. Alternatively, if TILE-Gx processor is the root complex with an
attached NIC, the NIC’s driver running on TILE-Gx processor will collect packet data into Tile
memory; likely using the NIC’s push DMA. From Tile memory, the data can be sent through the
mPIPE using the eDMA gather/loopback function (Figure 4-9).
TILE-Gx PCIe NW Host/Appliance
D
D
R
3
Application
NIC
Driver
PDE
PCIe
Network
PCIe NIC
Figure 4-9: Hosted-NIC Model
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
83
Chapter 4 PCIe Controller Architecture (TRIO)
4.11Deadlock
PCIe ordering rules are defined in the PCI Express® Base 2.0 specification in Table 2-24. These
rules enforce PCI’s producer-consumer memory model and prevent deadlock.
These rules include the following for deadlock avoidance:
•
Non-Posted requests never block posted requests. The PIO buffer, for example, will bypass
stalled Non-Posted requests so that posted requests make progress.
•
Non-Posted requests never block completions.
Packets arriving from the PCIe port are mapped into the Tile memory system as follows:
Packet Type
Region Type
Dependencies
Memory Read (NP)
Tile Memory / Rshim
TileMem (read/resp), PCI Completion
NoMatch
Tile SW, PCI Completion
Tile Memory / Rshim
TileMem (write)
NoMatch
Tile SW
(Request Tracker)
TileMem (write), IDN (interrupt)
Memory Write (P)
Completion (CPL)
Because the Tile memory system is deadlock free and always drains, the dependencies are on Tile
SW and PCI completions. PCI completions always drain. So Tile SW is the only true dependency
that must be managed.
In order to support Tile offload models that might introduce dependencies between non-posted
and posted/completion transactions, the controller provides an optional non-posted ingress
credit counter. This counter decrements each time a non-posted packet is sent via the software
region. When zero, non-posted packets will not be dequeued from the MAC and posted/completion traffic will continue to make progress. The counter can be incremented or written by
software.
In order to improve the performance of posted/completion flows in the presence of a congested
PCI completion flow, the controller allows posted/completion traffic received on the PCIe link to
make progress even if a full PCI completion buffer is blocking incoming non-posted traffic. This is
not strictly required for deadlock-free operation, but allows more deterministic posted write
performance.
84
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 5
PCIe MAC I NTERFACE
5.1 Introduction
TILE-Gx™ PCIe interfaces are connected to TRIO to integrate the off chip PCIe subsystem with
TILE-Gx’s memory system. This is shown in Figure 5-1.
SerDes Lanes SerDes Lanes SerDes Lanes
PCIe/
StreamIO
MAC0
PCIe/
StreamIO
MAC1
PCIe/
StreamIO
MAC2
5,2
TRIO
(Transaction I/O)
Tiles
Figure 5-1: PCIe I/O Interface Subsystem
Implementation Note: The TILE-Gx36™ device contains three PCIe ports that connect to a single TRIO
instance.
Each PCIe interface provides the following features:
•
PCIe Gen-2 support (5Gbps per lane)
•
Each port can be configured as endpoint or host
•
Each port can be replaced by a StreamIO instance for lightweight FPGA connections
•
TILE-Gx is bootable via endpoint, root-complex, or StreamIO
•
Auto-negotiated link width
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
85
Chapter 5 PCIe MAC Interface
•
Lane/Polarity reversal
•
Max Payload Size of up to 1024 bytes for efficient data transfer
•
Nonblocking diagnostics access via endpoint, root, or StreamIO
•
SR-IOV support
•
MSI/MSI-X plus legacy interrupt support
•
Support for mPIPE™ buffer descriptors for zero-copy packet-to-Tile-to-PCIe transfers
•
Compliant with the PCI Express® Base 2.0 specification
•
Programmable device capabilities including BAR sizes
•
Support for OEM of PCIe device via writable Vendor/DeviceIDs and other capability structures
•
Multiple simultaneous data movement models including push DMA, pull DMA, PIO, memorymapped, and mailbox/doorbell with no Tile software overhead
•
Low power and power-down modes supported in both the PHY and the MAC including active
state (dynamic) power management, L1/L2 power down, and beacon/wake.
•
Crosslink support to allow identically configured endpoint or root ports to be interconnected
•
Advanced error reporting, function-level-reset, and vendor messaging capabilities
5.2 Register Spaces
PCIe registers are accessible via MMIO space. See the TRIO_CFG_REGION_ADDR register specification. The physical address provided to TRIO is interpreted as follows to provide access to MAC/
MAC-interface registers:
Figure 5-2: TRIO_CFG_REGION_ADDR Register
Table 5-1. TRIO_CFG_REGION_ADDR Register Descriptions
86
Bits
Name
Type
Reset
Description
36:32
REGION
RW
0
Selects CFG_SPACE
21:20
PROT
RW
0
Unused for MAC address space
19:18
MAC_SEL
RW
0
Selects the MAC being accessed.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Register Spaces
Table 5-1. TRIO_CFG_REGION_ADDR Register Descriptions (continued)
Bits
Name
Type
Reset
Description
17:16
INTFC
RW
0
Interface being accessed.
15:0
REG
RW
0
Value
Name
Meaning
0
TRIO
Access to centralized TRIO
registers. 8-byte oriented
registers.
1
MAC_INTERFACE
Access to per-MAC interface control registers (interrupts, serdes control etc.).
8-byte oriented registers.
2
MAC_STANDARD
Access to per-MAC registers (PCIe config space
etc.). This interface is typically only used by BIOS/discovery software since it
treats BAR registers as
read only and thus prevents
BAR resizing. 4-byte oriented registers. Also supports 1 and 2 byte
operations. The upper 4bytes of an 8-byte store will
be discarded. The upper 4bytes of an 8-byte load will
be zeroed.
3
MAC_PROTECTED
Access to per-MAC registers (allows writing of BAR
registers). 4-byte oriented
registers. Also supports 1
and 2 byte operations. The
upper 4-bytes of an 8-byte
store will be discarded. The
upper 4-bytes of an 8-byte
load will be zeroed.
Configuration register to be accessed. Note that TRIO and
MAC_INTERFACE registers are always aligned on 8-byte
boundaries and access is always 8-bytes at a time. MAC registers are 4-byte oriented.
5.2.1Type-0/1 and Virtual Function Configuration Space
Access to Type-0 (endpoint) and Type-1 (root complex) configuration space is provided within
the MAC_STANDARD and MAC_PROTECTED address spaces. To access the type-0 config space
for virtual functions (SR-IOV), the TRIO_PCIE_INTFC_VF_ACCESS register must be programmed
with the target virtual function number.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
87
Chapter 5 PCIe MAC Interface
5.3 Port Configuration
Each PCIe port may be configured via either strapping pins or MMIO registers. See the TRIO_PCIE_INTFC_PORT_CONFIG register for strapping and software port configurations.
If the port is enabled via strapping pins, the port will train automatically without software running on the Tile. However, if the port is to be trained after software boot, the
TRIO_PCIE_INTFC_PORT_CONFIG register can be used to select the device type (StreamIO vs.
PCIe-Root vs. PCIe-Endpoint) as well as other port-specific settings.
Once a port is enabled, it will automatically train to the widest possible link width and fastest possible link speed. Polarity and lane reversal will also be auto-negotiated.
5.4 IO Address Mapping
The 64-bit PCIe address space is separate from the Tile physical address space. Address translation is used to provide protections of both the PCIe and the Tile address spaces.
The addresses of PCIe requests arriving from the MAC are translated to Tile physical addresses
using the map regions and IO TLB / IO MMU mechanisms described in Section 4.8 Ingress Mapping Regions.
5.4.1Boot and Diagnostics Access
Access to TRIO’s boot and Rshim region (see Section 4.8.3 Boot and Rshim Regions) is typically
through the low 1MB of BAR0 (endpoint) or low 1MB of the PCIe address space (root). The
Rshim/boot region is relocatable at runtime.
In order to allow boot and debug access regardless of the BAR0 offset, incoming BAR0 accesses
have their upper address bits masked. This causes the same (low) address bits to be passed to
TRIO regardless of where system software locates the Tile-Gx BAR0 within the PCIe address
space. The masking function is programmable via the TRIO_PCIE_INTFC_RX_BAR0_ADDR_MASK.
5.5 Interrupts
The PCIe interface supports legacy, MSI, and MSI-X interrupt mechanisms as well as interrupt
signaling through the MAP-MEM regions in TRIO.
As a PCIe endpoint, MSI/X and legacy interrupts may be dispatched to the root-complex via the
TRIO_PCIE_INTFC_EP_INT_GEN register.
As a PCIe root complex, legacy interrupts from devices are reflected as INT_LEVEL/INT_DEASSERT/INT_ASSERT interrupts in the TRIO_PCIE_INTFC_MAC_INT_STS interrupt status register.
MSI/X interrupts to a root complex port arrive as writes that may be mapped by system software
into the MAP-MEM interrupt registers. StreamIO may also use the MAP-MEM interrupt registers.
TRIO’s MAP-SQ doorbell interrupts may also be used for application-level interrupts.
5.6 Power Management
PCIe power management support is provided by hardware. Software may initiate certain power
management transitions via writes to the port’s PM D-State register and the TRIO_PCIE_INTFC_PM_INTFC_CTL register.
Active state power management (ASPM) L0s/L1 transitions are handled completely by the hardware as are the activities associated with D0-3 state transitions.
88
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Link Down Handling
Transitions into and out of L2/L23 require software interaction via the XMT_TURNOFF,
XMT_PME, and LINK_RESTART bits in the TRIO_PCIE_INTFC_PM_INTFC_CTL register.
5.7 Link Down Handling
PCIe ports may go down due to surprise-remove, power management transitions, excessive link
errors, or system reset. When this occurs, transactions targeting the port must be flushed out of
the system to allow transactions targeting other ports to remain unaffected.
Read requests from TRIO Pull DMA are automatically “timed out” when their link goes down.
PIO requests will timeout naturally. These timeouts will automatically restore the link’s tag space
and completion credits. Similarly, pending write requests from PIO and Push DMA as well as
ingress MAP completions will be flushed out of the port to prevent blocking of transactions to
other ports.
The link can be brought back up by software or can automatically retrain and be used as normal.
5.8 SERDES Configuration
The SERDES configuration (PLL settings, drive strength, de-emphasis, equalization etc.) is typically handled by the hardware automatically. However, if customized settings are needed, the
TRIO_PCIE_INTFC_SERDES_CONFIG register provides an access mechanism to SERDES-specific
settings.
The SERDES registers are not documented as part of the TILE-Gx IO Guide.
5.9 Streaming Interface
The TILE-Gx PCIe supports a SERDES-based streaming data interface for transport of bulk data to
and from external devices such as FPGAs.
When the streaming interface is used, it replaces the associated PCIe MAC with a lightweight
datalink layer which provides packetization, lane bonding, symbol encoding, error detection, and
flow control without the need for a complete PCI Express feature set (configuration registers,
etc.).
The interface supports from 1 to 4 lanes at rates up to 6.25Gbps per lane. An inband flow control
mechanism provides simple credit-based flow control of the datalink buffer.
The streaming interface supports all of the data movement models described previously
including:
•
Push DMA: Moves data from Tile memory to remote device.
•
Pull DMA: Moves data from remote device to Tile memory via “reads”.
•
MAP MEM: Writes and reads from remote device mapped into Tile memory. Provides support
for ordered, unordered (pipelined), and interrupt traffic.
•
MAP SQ: Writes and reads from remote device mapped into Tile memory with Tile-side
descriptor FIFO specifying the address and doorbell register for interrupts.
•
Tile PIO: Loads and stores from Tile mapped to remote device writes and reads.
•
Boot/Debug: Accesses to the Rshim from the remote device for boot and debug via a dedicated
address space.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
89
Chapter 5 PCIe MAC Interface
5.9.1Packetization
The streaming data is packetized into fragments from 8 bytes to 1KB. This allows insertion of
credit and clock compensation frames at regular intervals and retains compatibility with the PCI
Express DMA infrastructure which has a 1KB max payload size.
Each packet contains a 64-bit address. This represents the I/O address for push and pull DMA
transactions. For streaming data sent from the remote component, the address represents the buffer location for the data.
To implement a ring buffer, the address would be incremented by 1024 on each packet and
wrapped back to zero at the ring boundary. Double buffer schemes would be similarly implemented by the remote device by keeping track of the buffer and offset being written.
5.9.2Interrupts
Interrupts can be delivered from the remote device via the streaming interface by using one of the
MAP SQ doorbells or by using a MAP MEM region in MSI mode.
5.9.3Flow Control
The streaming interface's link layer provides flow control for a small FIFO between the transaction and link layers. This flow control allows lossless transfer of data regardless of the resources
or clock rates provided on the remote device. The FIFO is typically sized just to cover the bandwidth delay product of credit updates accounting for the packet fragmentation size being used.
Typically this buffer will be between 1-4KB to maintain line rate of the interface.
90
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 6
M PIPE
A RCHITECTURE
6.1 Overview
The TILE-Gx™ Multicore Programmable Intelligent Packet Engine (mPIPE™) provides line rate
services for the packet interfaces. These services are also available for packet data stored in onchip buffers such as flows being moved to and from a PCI interface.
Some example services supported by the mPIPE are:
•
Parse: Identify the flow of each incoming packet
•
Packet Distribution: Make ingress packet data available for worker processing
•
Buffer Management: Track and distribute packet buffer resources
•
Load Balance: Spread work across multiple Tiles
•
Checksum: calculate the L4 checksum on ingress and egress traffic
•
Gather: Collect packet data from Tiles – potentially scattered across multiple buffers
•
Egress: Send packet data to the wire
Functionality of the mPIPE is divided between ingress and egress described in the following
sections.
6.1.1
Glossary
The following terms are used throughout this chapter:
•
MAC: A physical device connected to the mPIPE. Can include more than one channel.
•
Port: The same as MAC.
•
Channel: A set of physical resources in the mPIPE. Each MAC is associated with one or more
channels. Depending on system configuration, more than one MAC might share a channel. For
example, when the two MACs share a set of input pins and hence can’t be in service simultaneously.
•
eDMA Ring: An egress descriptor ring. Each ring is associated with exactly one channel and
hence exactly one MAC/Port.
•
Priority Queue: A set of virtual resources in the mPIPE. Ingress packets are assigned to a priority queue by the MAC. Egress packets are queued based on the configuration of their associated eDMA ring.
•
NotifRing: Data structure stored in Tile memory containing ingress packet descriptors created
by the classifier.
•
Classifier: Processor that parses incoming packets and determines to which flow the packet
belongs.
•
Load Balancer: Assigns incoming packets to NotifRings.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
91
Chapter 6 mPIPE Architecture
For more terminology refer to “Glossary, Conventions and Standards” on page 585.
6.1.2
PHY and DMA Sharing
TILE-Gx™ processor shares its SERDES lanes between many different interfaces. The specific
interfaces vary depending on the device within the TILE-Gx processor family.
Implementation Note: The TILE-Gx36™ device supports the following interfaces connected to the
mPIPE:

Four XAUI

Sixteen Gigabit Ethernet (CDR-based SGMII)
Since these interfaces share SERDES lanes, not all configuration cross products are possible.
Similarly, a common mPIPE is shared between the interfaces described above. This sharing of
PHYs and packet processing services is shown in Figure 6-1.
SERDES Lanes
PHY Distribution Layer
XAUI
MACs
GbE
MACs
Interlaken
MACs
MAC Distribution Layer
Channelized
iDMA
Channelized
eDMA
mPIPE
Tiles
Figure 6-1: PHY/DMA Sharing
6.1.3
Channelization
The mPIPE must manage traffic across multiple interfaces or multiple flows within the same interface (Interlaken). The mPIPE provides independent resources for each channel to support QoS
and non-blocking flows.
Implementation Note: The TILE-Gx36 device provides 20 channels mapped to up to 16 active MACs and
4 loopback channels.
92
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
6.1.4
Channels vs. Ports
There is a distinction between physical port and channel. A given physical port (such as SGMII
and XAUI) is statically assigned to a channel. In the case of multi-channel interfaces such as Interlaken, the physical port is statically assigned to multiple channels.
Note that a channel might be assigned to multiple ports. This can happen when two ports cannot
be physically active at the same time in a system. For example, they might share a PHY or other
physical resource. The MPIPE_MAC_MAP registers (MPIPE_MAC0_MAP, as an example) specify
which channels are associated with each port (MAC). And the MPIPE_MAC_ENABLE register
defines which ports are enabled.
On ingress, the classification and load balancing steps described in detail later in this document
define how packets arriving on a channel get distributed to various notification rings (workers).
For egress, the eDMA rings described in detail later in this document each have a configurable
output channel thus an eDMA ring is associated with a single output channel and a single output
port.
6.1.5
Priority Queues
To support priority queuing standards such as 802.1Qbb, the mPIPE provides virtual queues for
both ingress and egress. Each MAC (port) determines how its traffic is mapped into the mPIPE’s
priority queues. Egress traffic is assigned to queues based on the PRIORITY_QUEUES bit mask in
the MPIPE_EDMA_RG_INIT_DAT_MAP register. Flow control for ingress traffic is provided on a
per-queue basis and each MAC responds to the flow control based on the MAC’s configuration
(for example pause frames, priority pause, or other inband or out of band flow control).
Implementation Note: The TILE-Gx36 device supports 16 priority queues.
6.1.6
Communication Model
The TILE-Gx processor memory system is optimized for efficient communication between Tiles,
cache controllers, memory controllers, and I/O devices. Bulk data transfer is done via the memory system. The mPIPE supports caching hints to optimize cache and memory-bandwidth
utilization based on the system’s locality attributes.
Interrupts are delivered through the Interprocessor Interrupt (IPI) mechanism.
6.2 Ingress Services
TILE-Gx processor’s ingress hardware manages incoming packets from the I/O channels. The
ingress mPIPE parses the packets, writes the packet data into software-visible buffers, then load
balances across worker Tiles.
6.2.1
Typical Ingress Flow
Packets typically take the following steps through the ingress portion of the mPIPE. Subsequent
sections describe the mechanisms used to implement this flow.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
93
Chapter 6 mPIPE Architecture
MAC Distribution
1
Classification
iPkt
Buffer
2a
2b
DMA
CMDs
2c
Load
Balance
Descriptor
Buffer
3
iDMA
Worker
Notification
Buffer
Manager
4
5
Write Packet
Data
Write Descriptor
Data
6
Notify
Worker
Figure 6-2: Typical iDMA Flow
1.
A packet is assembled by PHY and MAC layers and presented to channelized iDMA.
2.
The packet is classified in order to identify the flow and choose a buffer pool. The classification steps generate a packet descriptor containing the following:
a. Buffer Pool / DMA control.
b. Bucket (typically computed from the flowID).
c. Custom fields (for example: flow hash) passed to software.
3.
The load balancer chooses a worker, based on the hashed FlowID.
4.
Packet data is written to the chosen buffer pool, typically into the Tiles’ L3 cache.
5.
A packet descriptor is written into a ring in memory space, typically local to the worker.
6.
The worker is notified that a new packet descriptor is available via interrupt and/or update to
the ring’s tail pointer.
6.2.2
Buffers
Packet data is stored in buffers. Each buffer has an associated descriptor that defines the buffer’s
attributes (virtual address, size, chaining).
The buffer descriptor format is shown in Figure 6-3.
94
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
C
Size
HWB
Reserved
StackIDX
VA[41:7]
Reserved
9
8
7
6
5
4
3
2
1
0
Offset
0x00
Figure 6-3: Buffer Descriptor Formats
Table 6-1. Buffer Descriptor Formats
Bits
Fields
Description
63:62
C
Chaining Designation. Set by iDMA hardware and application SW prior to eDMA
0
Unchained buffer pointer
1
Chained buffer pointer. Next descriptor stored in 1st 8-bytes in buffer.
3
Invalid Descriptor. Could not allocate buffer for this stack (iDMA) or end of chain
(i/eDMA).
61:59
Size
Size of Buffer. Encoded as follows:
0
128 bytes
1
256 bytes
2
512 bytes
3
1024 bytes
4
1664 bytes
5
4096 bytes
6
10368 bytes
7
16384 bytes
58
HWB
Hardware Buffer. Indicates that this is a hardware-managed buffer. This bit will always be set
for ingress packets. On egress, this bit indicates that the buffer should be returned to the hardware buffer manager.
57:53
Reserved
Reserved
52:48
StackIDX
Buffer stack to which this buffer belongs.
47:42
Reserved
Reserved
41:7
VA
Virtual address bits 41 to 7. Buffers are always aligned to 128-bytes, though packet data might
be offset within the buffer.
6:0
Offset
Start byte of data within the 128-byte aligned buffer.
The buffer descriptor provides a VA. The stack to which a buffer belongs provides an address
space identifier (ASID) to allow independent address spaces concurrently within the same system. See section “Virtual Memory” on page 133.
MiCA accelerator blocks, as described in “Common Accelerator Interface (MiCA)” on page 175,
also are capable of reading/writing buffers that are written/read by mPIPE. However, note that
only mPIPE accesses the Buffer Stacks (see 6.2.2.1 Ingress Services).
6.2.2.1
Buffer Stacks
Buffer descriptors are stored in stacks that are managed by a buffer stack engine in the mPIPE.
Each time the iDMA engine requires a new buffer, it pops a descriptor from the associated stack
managed by the buffer stack engine. Each time the eDMA engine frees a buffer, the descriptor is
pushed back onto the stack by the buffer stack engine. Software can also return buffers to the
stack by sending a message to the buffer stack engine.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
95
Chapter 6 mPIPE Architecture
By storing buffer descriptors in stacks, TILE-Gx processor optimizes temporal locality of packet
data. This locality also allows a portion of the stack to be cached by the buffer stack engine, thus
reducing the bandwidth required to read buffer descriptors from Tiles and memory.
The mPIPE supports multiple buffer stacks to allow size differentiation such as {small, medium,
large, jumbo}. Multiple stacks also allow QoS guarantees for different flows and/or applications
and supports buffer regioning such that a group of workers assigned to a given application can
have their buffers homed on a specific Tile or hashed-for-home within a set of Tiles.
Implementation Note: The TILE-Gx36 device supports 32 buffer stacks.
The buffer stack used for an incoming packet is chosen during classification. Each buffer stack
corresponds to an ASID. The TLB entry associated with a buffer’s VA and ASID provides the caching attributes including NonTemporal hint (NT-HINT), Pinning, and homing information. These
attributes include:
•
Buffer Size: All buffers in a stack must be the same size.
•
Background data policy: Specifies whether or not to fetch background data on partial cacheline
writes or to zero unused bytes.
•
Write Miss Policy: Specifies whether or not to allocate in the cache when a write misses or send
to main memory.
•
Write Hit Policy: Specifies whether or not to update the temporal hint (LRU) on a write hit.
•
Read Inval Policy: Specifies whether or not to invalidate the cacheline after a read to save memory bandwidth (note that subsequent accesses will see unpredictable data).
6.2.2.2
Buffer Chaining
TILE-Gx processor’s iDMA flow supports automatic scatter via buffer chaining. When a packet
exceeds the size of a single buffer, the packet is fragmented across multiple buffers and a link is
created from each buffer to the subsequent buffer. The links are simply buffer descriptors written
into the first 8 bytes of the buffer. The first 8 bytes of the final buffer in the chain are reserved/
unpredictable. Each buffer descriptor contains both the buffer’s VA (virtual address) and its offset. While the VA points to the very beginning of the buffer, data will be written into the buffer
starting at the offset.
Figure 6-4 shows a buffer chaining example where:
•
The Buffer size is 128 (120 bytes available for data).
•
Three 128B buffers are required for a 300-byte packet.
•
All buffers except for the first and last are full.
•
All buffer descriptors except the first one have an offset of 8.
•
The buffer descriptor in the final buffer is marked INVALID.
See the exception for cut-through, as described in 6.2.2.2 Buffer Chaining and 6.2.2.3 Ingress
Services.
Note that the Buffer Descriptor shown in Figure 6-4 is associated with iDesc, however this is not
the only place a buffer descriptor can be used. Buffer descriptors can also be part of eDMA Packet
Descriptors (refer to Figure 6-17 “eDMA Descriptor Format” on page 124) or can even be used
standalone (refer to Figure 11-5 “Using a List of Buffer Descriptors as a MiCA Destination Mode”
on page 183).
96
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
Buffer Descriptor
C=1, Offset=8
I2_size=300
iDesc
Buffer-0
108 Valid Bytes
Buffer Descriptor
Size=0(128B), C-1, Offset=20
Buffer Descriptor
C=1, Offset=8
Buffer-1
120 Valid Bytes
Buffer Descriptor
C=3 (INVALID)
Buffer-2
72 Valid Bytes
Figure 6-4: Buffer Chaining Example
Packets that cut through (as indicated by the CT bit in the packet descriptor) will always use
chained buffers, unless the buffer size is set to 7 (16384 bytes). In this case, the packet will always
fit into the buffer or will be truncated by the hardware.
For packets that have been designated to be handled by cut-through methods, the buffer descriptor’s chain-valid indicator cannot be used to determine the end of the buffer chain. Unlike storeand-forward methods, cut-through handling of packets does not need to write to the final buffer
(indicating the end of the packet transmission). Instead, software must calculate the number of
buffers that were used and follow the chain appropriately.
When buffers are chained, the 7-bit offset field in the buffer descriptor includes the 8-byte buffer
chain field. So, for example, if the data starts at byte 23 of the buffer with the chain in bytes 0-7,
the offset field would be 23. The following rules summarize the relationship between buffer
chaining and the buffer offset:
•
The offset field in the buffer descriptor is in bits[6:0], so the maximum offset is 127.
•
If the chain field is BDESC_CHAINED, the minimum offset is 8.
•
The classifier does not know if the buffer will be chained, so the offset it picks can have 8 added
to it when it gets to the iDMA engine.
•
The classifier must never choose an offset greater than 119 if the buffer could possibly be
chained.
•
The iDMA engine will apply the classifier-specified offset setting only to the first buffer in the
chain. Hence all bdescs (buffer descriptions) in the buffer chain, other than the one in the
pDesc, will have an offset of 8.
•
The eDMA engine expects all bdescs with a chain field of BDESC_CHAINED to have an offset
of at least 8.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
97
Chapter 6 mPIPE Architecture
•
The eDMA engine will properly handle descriptors “mid chain” when the buffers are full of
data (except for the header), that is, when their offset fields are equal to 8.
6.2.2.3
Buffer Release
In many systems, the egress hardware can be used to release chained buffers back to the buffer
stack manager. In these systems, software simply posts the eDMA packet. The hardware will collect and free the buffers. If software decides to discard the packet, the buffers can still be collected
and freed by the egress hardware using the NoSend flow described in “NoSend Option” on page
133.
Note that for chained packets that have cut-through, the final buffer descriptor stored in the last
buffer in the chain will need to be released explicitly by software since the egress buffer collection
hardware will not free this buffer. For more information see “Transaction Sizing and Buffer Offsets” on page 129.
Software Buffer Release
In systems where a hardware buffer release is not desirable or possible, software is required to
maintain and/or release the buffers.
A generic method for software to determine which buffer descriptors need to be freed from an
ingress packet is shown in Listing 6-1.:
Listing 6-1. Example Buffer-Release Algorithm
total_bytes = pkt_descriptor.L2_SIZE + // packet bytes
pkt_descriptor.OFFSET // offset (zero pad)
8
// 8 bytes of offset include the chain pointer
if (!pkt_descriptor.chained)
num_descriptors = 1
else
// number of descriptors required – accounting for
// chain pointers
num_descriptors = roundup(total_bytes/(buffer_size-8))
//
//
//
if
extra descriptor for cut through packets
Largest buffer size never chains
on buffer error, there won’t be an extra buff desc
(pkt_descriptor.CT &&
(buffer_size != 16384) &&
!pkt_descriptor.BE)
num_descriptors++
// get 1st buffer descriptor from packet descriptor
buf_desc = pkt_descriptor.BUFF_DESC
while (num_descriptors--)
next_buf_desc = *(buf_desc.VA & ~0x7f)
free_buf(buf_desc)
buf_desc=next_buf_desc
Alternatively, in systems where packets might cut through, software could insure that the first 8
bytes of all buffers were always written with an invalid buffer descriptor prior to freeing any buffers. This would allow software simply to follow the buffer chain linked list and free all valid
buffer descriptors, stopping once it reached an invalid descriptor.
98
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
6.2.2.4
Buffer Stack Engine
The buffer stacks are managed in hardware using the Buffer Stack Engine. System software
assigns a physical address range for each stack. These ranges must be aligned to 64KB regions in
physical address space. The stacks are configured using the MPIPE_BSM_INIT_CTL and
MPIPE_BSM_INIT_DAT registers.
The buffer stack engine maintains a prefetch buffer for the top of the stack allowing fast access by
the iDMA engine and reducing Tile memory system bandwidth when a steady state of eDMA
buffer returns is able to feed directly into the iDMA engine.
Implementation Note: The TILE-Gx36 device prefetches up to 64 descriptors for each stack.
Typically, buffers are consumed by iDMA hardware and released by eDMA hardware. However,
software can manually post buffers via an MMIO write to the buffer stack engine. Software can
also consume buffers by directing an MMIO read to the stack engine. MMIO reads to a buffer
stack will only return a valid buffer indication if the stack’s prefetch buffer has descriptors
available.
If there are descriptors in the stack, but the prefetch buffer is temporarily empty, the MMIO read
will return a descriptor with the chain-mode set to BDESC_NOT_RDY (2). Software can re-read
until it gets a valid descriptor. Generally this condition will only occur if the low-water mark in
the MPIPE_BSM_CTL register is set below the recommended value.
If there are no descriptors in the stack, a MMIO read will return a descriptor with the chain-mode
set to BDESC_INVALID (3).
Hardware spills/fills descriptors to/from the Tile-memory based stack, thus the format of these
descriptors is typically not relevant to software. However, software can choose to “preload” the
memory-based buffer stacks or otherwise interpret the data. The format for the hardware-managed buffer stacks is provided in Table 6-2.
Table 6-2. Hardware-Managed Buffer Stacks
Byte7
Desc-1
Byte6
Byte5
Byte4
Byte3
Byte2
Desc-0
Desc-2
Desc-4
Desc-3
Desc-10
Desc-6
Desc-9
Tilera Confidential — Subject to Change Without Notice
0x28
0x30
Desc-10
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
0x18
0x20
Desc-7
Desc-11
0x08
0x10
Desc-4
Desc-8
Byte0
0x00
Desc-1
Desc-5
Desc-7
Byte1
0x38
99
Chapter 6 mPIPE Architecture
Each Desc is a compressed buffer descriptor with the following format:
39
35
34
Reserved
0
Buffer VA[41:7]
Figure 6-5: Buffer Stack Manager Memory Format
The format is repeated for each 64-byte block (12 descriptors), as described above. Buffers are
stored in blocks of 12 descriptors so software must take care to add or remove blocks of 12 if it is
manipulating the Tile-memory based stacks directly. Typically, software will add and remove
buffers via the MMIO buffer post/fetch interface hence the format and blocking of descriptors is
not relevant.
6.2.3
iDMA Packet Descriptors
Each packet traversing the iDMA flow is assigned a packet descriptor. The packet descriptor is a
64-byte summary of the classification, load balancing, and buffer management aspects of ingress
processing.
The packet descriptor is delivered to the worker via memory space writes to notification rings.
The packet descriptor also controls DMA processing and load balancing. Its format is shown in
Figure 6-6. Note that this format is described in the mPIPE I/O descriptor header file.
iD M A P a c k e t D e s c rip to r (En try in N o tifR in g )
B yte 3
B yte 2
B yte 1
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
B yte 0
C hannel
CS
NR
TS
SQ
PS
BE
D est
ME
TR
CT
CE
L2_S ize
CTR0
C S U M _S E E D /V A L
N otifR ingID X
B uck etID
C S U M _S T A R T
CTR1
G P _S Q N / G P _S Q N _S E L
P ack etS Q N
T im eS tam p
C
S ize 1
V A [3 1 :7 ]
S ta ckID X
O ffse t
V A [4 1 :3 2 ]
0x00
F illed by H W
0x04
M ust be filled by
classifier
0x08
0x0C
0x10
0x14
0x18
0x1C
0x20
0x24
0x28
0x2C
0x30
0x34
0x38
0x3C
H W or C lassifier
R eserved
G eneral U se
(custom data from
classifier to S W )
C ustom or H W
B u ffe r
D e s c rip to r
Figure 6-6: iDMA Packet Descriptor
100
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
Table 6-3. iDMA Packet Descriptor Formats
Bits
Description
Channel
Source Channel.
PS
Enable PacketSQN Insertion.
When 1, packet sequence number (packetSQN) will be inserted.
When 0, packetSQN field can be filled with custom data from classifier.
PacketSQN
Packet Sequence Number. Assigned at notification time.
TimeStamp
Timestamp assigned at arrival from MAC.
CS
Checksum Generation Enabled.
TS
Enable TimeStamp Insertion.
When 0, the timestamp field can be used for custom data by the classifier.
When 1, hardware inserts the timestamp that was captured when the start of packet was received
from the MAC.
CSUM_SEED
Initial seed for checksum (from classifier), later filled with CSUM result by HW.
VAL
CSUM_Start
Start Byte for Checksum.
L2_Size
Final L2 size of Packet. Written at notification time. Does not include preamble or CRC unless those
fields enabled for pass-thru from MAC.
ME
MAC Error. Generated by the MAC Interface. Asserted if there was an overrun of the MAC's receive
FIFO. This condition generally only occurs if the mPIPE clock is running too slowly.
CE
L2 CRC Error. Generated by the MAC. Asserted if MAC indicated an L2 CRC error or other L2 error
(bad length, etc.) on the packet.
CT
Cut-through. When asserted, the packet was not completely received before being passed to classifier. The L2_Size field indicates the number of bytes received so far.
TR
Truncate. The packet was truncated due to out-of-space in the iPkt buffer .
BE
Buffer Error. Indicates it ran out of buffers for this stack. SW must still free any buffers in the chain.
BucketID
BucketID Filled by Classifier.
NR
NotifRingIDX is going to be determined by classifier instead of load balancer .
StackIDX
Buffer stack to use for this packet.
NotifRingIDX
NotifRing to write pDesc into. Typically filled by load balancer, but can be overridden by classifier with
NR bit.
GP_SQN
Sequence number applied when packet is distributed. Classifier selects which sequence number is to
be applied by writing the 13-bit SQN-selector into this field.
SQ
When asserted, the GP_SQN_SEL field contains the sequence number selector and the GP_SQN
field will be replaced with the associated sequence number. When clear, the GP_SQN field is left
intact and be used as “Custom” bytes.
Size
Filled by the stack manager based on the buffer chosen from StackIDX.
VA
C
Offset
Start offset within the buffer for the packet data.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
101
Chapter 6 mPIPE Architecture
Table 6-3. iDMA Packet Descriptor Formats (continued)
Bits
Description
Dest
0
1
2
3
CTR0
Drop Packet, Drop pDesc
Write packet to buffer(s) from BufferIDX. Descriptor sent to load balancer for notification.
Write pdesc, drop packet (used for recirculated packets where packet data already in Tile
buffers)
RSVD
Encoded Counter Selects. The associated counters are incremented when the packet is sent.
CTR1
6.2.4
Notification Rings
Packet descriptors are written into rings stored in memory space. These rings represent work
queues for individual Tiles. There are more rings than worker Tiles to allow a given worker to
have multiple work queues. This provides the capability to have high priority packets, for example, assigned to its own ring.
Implementation Note: The TILE-Gx36 device supports 256 Notification Rings.
Each notification ring is associated with a specific worker Tile. The NotifRingTable in the
mPIPE stores the memory location of the ring, the TileID of the Tile assigned to receive notification messages (if enabled), the ring size (126, 510, 2046, or 65534 descriptors), and the current ring
count. This table is accessible via the MPIPE_LBL_INIT_CTL/MPIPE_LBL_INIT_DAT registers.
Each ring holds two fewer descriptors than the associated memory footprint. This allows for one
entry to hold the tail-pointer data and one entry, so that software can distinguish empty versus
full by comparing head and tail values. If software requires more than 65534 packets to be
enqueued for a given ring, it must copy the 64-byte packet descriptors to a software-managed
ring.
The mPIPE writes packet descriptors into the notification ring assigned by the load balancer. After
the descriptor is written into the ring, the tail pointer is updated both in the NotifRingTable
and in the Tail field stored in the Notification Ring itself. The Tail field is updated with either
an 8-byte or 64-byte write operation as specified by the TUP_PTC configuration (tail pointer
update) bit of the MPIPE_NTF_CTL register. In all cases, the first 64-bytes of the ring are NOT used
for packet descriptors thus the tail pointer will never be zero.
Software must initialize its head pointer to 1. Each time it processes a descriptor, it should increment by 1. Once it passes the ring size, it must be set back to 1. For example, in a size=2048 ring
(up to 2046 descriptors), the increment could be done by:
head = head+1;
head = (head & 0x7ff) + (head >> 11);
The worker Tile can poll on the tail pointer stored in front of the NotifRing to see when a new
packet becomes available (for example head != tail). Optionally, an interrupt can be sent to
the worker when the tail has been updated. The tail pointer stored in the NotificationRing
represents the next ring location to be written. Hence, the first descriptor written into the ring will
result in the Tail value being written to a 2, indicating that the descriptor in location 1 is valid
(location 0 contains the tail pointer and padding to the next 64 bytes).
102
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
Packet
Data
Packet
Descriptor
(64 Bytes)
Buffer Descriptor
(8 Bytes)
Tail Pointer,
Also Stored
in NotifRing Table
Tail
Head
Updated by Message from Worker,
Stored in NotifRing Table
NotifRing
PDE -- (Multiple Rings -- Stored in Memory Space)
NotifRing
PDE -- (Multiple Rings -- Stored in Memory Space)
NotifRing
PDE -- (Multiple Rings -- Stored in Memory Space)
Figure 6-7: Notification Ring Data Structure Format
6.2.5
Store-and-Forward vs. Cut-Through
Ingress store-and-forward operations allow the packet distribution hardware to make buffer and
load balance decisions after the complete packet has been received. This provides, for example,
the true L2 size and CRC validation to the classification stage. For high bandwidth flows, the
increase in latency due to store-and-forward is generally not significant.
However, in a multi-channel configuration, the additional storage and data-copy required for
large frames can make store-and-forward impractical due to area, power, and bandwidth constraints. Implementations might require the use of cut-through operations in certain
configurations.
Implementation Note: The TILE-Gx36 device provides 192KB of ingress buffering for store-and-forward
flows. This allows store-and-forward for up to 4 channels if jumbo frames are supported on the
interfaces. It also allows store-and-forward for up to 24 channels of non jumbo frames.
In configurations where cut-through must be used, the interface will still store-and-forward
frames up to the threshold in the CUTTHROUGH bit of the MPIPE_IPKT_THRESH register. This
allows, for example, the classifier to know the exact L2 size for packets up to 1600 bytes. Beyond
this point, the classifier will only know that it is “larger than 1600”.
6.2.6
Classifier
In order to provide flexible parsing and compatibility with future protocols, the mPIPE uses a
programmable classifier. The classifier’s job is to generate a packet descriptor based on the incoming packet headers. This packet descriptor identifies the flow for load balancing and ordering,
controls where the data is written, and identifies exception flows. The classifier parses all of the
L2 and some or all of the L3 headers in order to identify, for example, the IP source and destination and determine the L4 octet offset.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
103
Chapter 6 mPIPE Architecture
6.2.6.1
Parallel Processing
The classification step requires significant processing of the packet header. In order to provide a
flexible and scalable solution, parallel classification engines are used. Since the classification step
is stateless, there is no communication required between the engines and the performance scales
linearly with the addition of more classification engines.
Hardware in the mPIPE ensures that the packet order is maintained through classification, hence
the implementation of many parallel classifiers appears to the user as a single high-speed classification processor.
Implementation Note: The TILE-Gx36 device contains 10 classifiers, each running at up to 1.6GHz.
This provides up to 268 cycles to process each packet at the maximum arrival rate of 60 million
packets per second (40Gbps at minimum Ethernet packet size). For more information, refer to the
TILE-Gx36 Preliminary Data Sheet (DS400).
6.2.6.2
Cycle Budget
Each packet has a classification cycle budget based on the wire time of the packet. The classifier’s
distribution and reorder architecture allows longer packets to take extra time for classification.
The BUDGET settings in the MPIPE_CLS_CTL register are used to determine the budget for each
packet according to the following formula:
CycleBudget =
((min(max(L2_SIZE,BUDGET_MIN)+BUDGET_OVHD),255)) *
BUDGET_MULT/128) +
BUDGET_ADJ
BUDGET_MULT is typically calculated at system configuration time based on the following
formula:
BUDGET_MULT = 128 * num_classifiers * freq / line_rate
Where line_rate is in megabytes per second and freq is the frequency of the classifier in MHz.
The BUDGET_ADJ setting compensates for integer truncation during the BUDGET_MULT and
CycleBudget calculations and also for the budget-expired exception time (3 cycles). If line rate is
required even when budget exceptions occur, BUDGET_ADJ must be -3 or smaller.
Once a header has exceeded its cycle budget, the classifier can terminate processing on the header
and jump back to PC zero (packet count=0) in order to start working on the next packet. When this
occurs, the packet descriptor will be assigned a fixed DEST, NotifRing, and BufferStack
based on the configuration in the MPIPE_CLS_CTL register. This can be used to debug the classification program and is not generally intended for use in normal operation.
The setting the CLSQ_HWM bit in the MPIPE_IPKT_THRESH register determines when the cycle
budget will be monitored. This allows some margin for occasionally exceeding the budget as long
as the classification queue is not filling up.
Using the cycle budget to drop the current header allows the classifier system to drop a packet
that is taking too long, rather than dropping unrelated packets that have accumulated behind it.
For example, a management flow can be guaranteed to be handled by the classification program
and will never be victimized by a flow that is causing the classifier to exceed its cycle budget.
The cycles remaining in the budget and state of the high water mark are readable by the classifier
program as an SPR in CLASSIFIER_BUDGET to allow programs to make dynamic decisions based
on time remaining.
104
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
6.2.7
Processor Architecture
The classifier, illustrated in Figure 6-8, is a compact 16-bit RISC processor with several special
instructions and datapaths to accelerate packet header parsing. Due to its simplified design, the
classifier has very few data-induced stalls. As a result, the performance is easy to predict and analyze. Once the worst case path through the instruction flow is determined, the cycle count can be
calculated or simulated to determine the maximum packet rate that can be supported.
CBR Mispredict
pred Taken
Exception
jr Instruction
Instruction Memory
From iPkt
+
Instruction
fPC
immd
(srcB Only)
Decode
2 Ports
Header Bytes (16 bits)
iHdr
hPtr
Bypass
Tbl
(4KB)
+
hPtr
Regfile
(23 x
16bit)
HSH
accums
tPtr
+
srcA
srcB
SPR
ALU
HSH
+
pdPtr
pDesc (64 bytes)
Figure 6-8: Classifier Architecture1
Headers from incoming packets are written into the iHdr buffer (This buffer is 256B.). This buffer
is directly addressable from the program to provide zero-latency access to header bytes. The output from the processor is a packet descriptor containing the buffer pool index, DMA control, and
custom data. The processor utilizes a 16-bit datapath for maximum efficiency in L2/3 processing.
1. The blocks with plus signs (+) indicate incrementers.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
105
Chapter 6 mPIPE Architecture
In addition to typical ALU operations, the processor supports several special purpose functional
units.
6.2.7.1
Header and Descriptor
The classifier interacts with the rest of the mPIPE ingress hardware via the packet header and
packet descriptor buffers. Instructions can directly consume bytes from the header and directly
write bytes to the descriptor by using dedicated register specifiers.
The header buffer is read-only and contains the first 256 bytes of the packet. The buffer is
addressed with the hPtr and the bytes can be consumed by any instruction. The hPtr can optionally be post incremented whenever bytes are consumed by the program. The program decides
which bytes of the header to use for classification. The pointer can also be explicitly written in
order to provide random access to the header buffer.
The descriptor buffer is a write-only 64-byte structure addressed by the pdPtr. The program can
write 1 or 2 bytes to the structure with any operation. Similar to the hPtr, the pdPtr can be
explicitly written to provide random access. pdPtr is not readable from the classifier program.
Both the hPtr and pdPtr are set back to their initial values whenever the PC transitions to zero
due to a branch, jump, or exception.
6.2.7.2
Table Lookup
The 4096 byte read-only memory provides lookup capabilities for MAC address matching, VLAN
information, or policy decisions. The table is written by Tile software during initialization. The
classifier program accesses this table via an indirect mechanism much like iHdr. The tPtr register (R24/mempos) can be written by any instruction and provides a byte address into the table.
Instructions that read R24/mem2 will retrieve two bytes at the 2-byte aligned tPtr address and
tPtr will be incremented by 2 (LSB of address is ignored). Instructions that read R23/mem1 will
retrieve one byte from the table and tPtr will be incremented by 1. tPtr is not readable from the
classifier program.
6.2.7.3
Special Registers
The classifier uses the instruction’s register specifiers to encode special access to the iHdr buffer
and pointer, the pDesc buffer and pointer, and the table and table-pointer. The register encodings
are defined in Table 6-4.
Table 6-4. Classifier Register Specifier Encodings
106
Register
Behavior as Source
Behavior as Destination
R0-R21
GPRs
R22
Read hPtr
Write hPtr
R23
Read table[tPtr++(1)] (mem1)
Write pdPtr
R24
Read table[tPtr++(2)] (mem2)
Write tPtr
R25
Read iHdr[hPtr] (peek2)
Write pDesc[pdPtr++(1)] (put1)
R26
Read iHdr[hPtr++(2)] (get2)
Write pDesc[pdPtr++(2)] (put2)
R27
HASH0_LO
R28
HASH0_HI
R29
HASH1_LO
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
Table 6-4. Classifier Register Specifier Encodings (continued)
Register
Behavior as Source
R30
HASH1_HI
R31
Zero
6.2.7.4
Behavior as Destination
Null
Hash Accumulator
Two 32-bit hash accumulators provide symmetric hashing of the flowID/Tuple. Either one of the
hash accumulators can be run in parallel with a checksum operation on the same source bytes.
Each hash accumulator is a pair of 16-bit GPRs. The hash instructions always accumulate into the
same pair of GPRs thus an accumulator destination operand is not needed. Both an 8 and a 16-bit
accumulate instruction are supported.
The hash function is a CRC using the same 232 polynomial as the crc32 instruction provided in
the Tile. This is the same polynomial used for Ethernet CRC.
The instruction set for the classification processor is defined in Appendix B: Classifier Instructions
and SPRs.
6.2.7.5
Endianness
When reading bytes from the iHdr or the Table, the bytes are interpreted as being big endian. The
byte pointed to by hPtr/tPtr feeds bits[15:8] of the source operand and the byte pointed to by
(hPtr+1) feeds bits[7:0] of the source operand.
This allows multi-byte fields such as EtherType/Len to be properly interpreted by the classifier.
When writing two bytes into the pDesc, bits[7:0] are written into the byte pointed to by pdPtr.
Bits[15:8] are written into the byte pointed to by (pdPtr+1).
Hence, to copy bytes from the iHdr buffer to the pDesc buffer, the program needs to swap the
bytes using an instruction or sequence similar to the one below.
// copy two bytes from iHdr to pDesc maintaining byte order
rotli put2, get2, 8;
The SEED field in the pDesc is interpreted in network order by the iDMA checksum calculator.
This means that bits[15:8] correspond to the CSUM_START byte of the packet and bits[7:0] correspond to the CSUM_START+1 byte of the packet. The resulting checksum is placed into the pDesc
with the byte corresponding to earlier packet check summed bytes in bits[7:0] and later check
summed bytes in bits[15:8].
6.2.7.6
Header/Descriptor Valid Indicators
Hardware in the ingress mPIPE writes the header data into the iHdr buffer then sets an internal
valid bit indicating that the classification program can consume the packet header. When the PC is
set to 0 due to a branch, jump, or exception to PC zero, the classifier automatically clears its internal header-valid indicator.
Similarly, hardware reads the packet descriptor when the PC is set to 0 due to a branch or jump
instruction or an exception directed to PC zero.
A double buffering scheme is used on both the iHdr and pDesc structures to allow pipelined
operation of the classifier processor array.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
107
Chapter 6 mPIPE Architecture
All classifier programs MUST have at least one iHdr read or an MFSPR from one of HEADER_FLAGS, CHANNEL, or L2_SIZE. Without accessing one of these structures, the program will not
properly interlock with incoming packet headers and iPkt buffer corruption will occur.
6.2.7.7
Classifier Pipeline
The classifier’s pipeline is bypassed to provide single cycle latency of most operations. Branches
and jumps incur extra latency to update the PC. This is visible to the programmer as additional
cycles to execute the program; there are no delay slots.
6.2.7.8
Stalls
When the program reads the iHdr buffer and data is not valid, the consuming instruction will
stall until data becomes valid. When stalled, the processor is in a low power state. Hence the program simply needs to read the iHdr buffer to automatically wait for the next packet header to
arrive.
When the hPtr is explicitly written by an instruction, the iHdr buffer will become invalid for 3-4
cycles while the iHdr buffer is read at the new location. The program can execute any instructions
within these 3-4 cycles, however if an instruction in this window reads the iHdr buffer, it will be
stalled until valid iHdr data is provided.
The classifier also stalls when the packet descriptor buffer has not yet been drained by the classifier’s “join” function and a new header has arrived. Double buffering is provided both on the
iHdr and descriptor buffers and stalls due to descriptor-buffer-full do not occur during normal
operation.
The stall conditions are summarized in Table 6-5.
Table 6-5. Classifier stall Conditions
Stall Condition
Instructions Affected
Cycles
Write to hPtr (RAW)
Consumers of hPtr (getpos)
1
Consumers of iHdr (get2/peek2)
4 if new hPtr%8=7.
3 Otherwise
Write to tPtr (RAW)
Consumers of the table (mem1/mem2)
3
Write to SPR (RAW)
MFSPR
1
Write to hash accumulator (RAW)
Any instruction that reads r27-r30 via SrcA or SrcB will
incur a single cycle stall if the previous instruction
updated the associated accumulator.
1
pDesc buffer full (iDMA
stall)
All
Stalls until space available
iHdr not valid (waiting for
new packet)
Consumers of iHdr (get2/peek2). MFSPR from HEADER_FLAGS, CHANNEL, L2_SIZE SPRs.
Stalls until new iHdr arrives
Conditional branches provide a static prediction hint to optimize the common or critical case for
the branch. The pipeline latencies for the various PC updates are summarized in Table 6-6.
Updates to the header pointer other than the built in post-increment incur a three or four-cycle
delay (the delay will be four cycles if the new hPtr%8 = 7). During the update window, operations that consume the iHdr data will stall. Instructions that do NOT consume iHdr will not stall,
so better performance will be achieved if non-iHdr instructions can be inserted after an hPtr
update.
108
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
Table 6-6. Classifier Instruction Latencies
Instruction Type
Prediction
Actual
Instruction Latency
Conditional Branch
Not Taken
Not Taken
1 cycles
Not Taken
Taken
3
Taken
Not Taken
3
Taken
Taken
2
Jump-Register
3
Exception
2
All Other Operations
1
hWin Updates
add hPtr,r1,r2
F
unrelated op (update cycle0)
unrelated op (update cycle1)
unrelated op (update cycle2)
unrelated op (potential update cycle2)
add r5,*hPtr++,r6
O
F
E
O
F
re0
E
O
F
re1/rdat0
E
O
F
rdat1
buf-vld
E
O
F
E
O
E
Figure 6-9: iHeader Pointer Update Latency
6.2.7.9
Persistent State
Although the classification processor is intended for stateless packet processing, some state is
retained from packet to packet. Much of the program-visible state is reset however. Table 6-7
summarizes the architectural state of the classifier.
Table 6-7. Classifier State
Item
Classifier Access
Tile SW Access
Action when PC Transitions to Zero
GPRs
Read/Write
Write Only
State left intact
Hash Accumulators
Read/Write
None
State left intact
Lookup table
Read Only
Write Only
State left intact
SPRs
Varies - see SPR descriptions
pDesc Pointer
Write Only
Write Only (init value)
Set to initial value
iHdr Pointer
Read/Write
Write Only (init value)
Set to initial value
tPtr
Write Only
None
Set to 0
pDesc
Write only
None
All bytes set to zero
iHdr
Read Only
None
Updated to next header
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
109
Chapter 6 mPIPE Architecture
6.2.7.10 Exceptions
The classifier supports a single exception flow for the case where the program attempts to consume header bytes from beyond the end of the iHdr buffer or beyond the L2 size of the packet.
This can occur, for example, when the classification program encounters an arbitrarily deep
encapsulation of protocols which exceed the reach of the iHdr window.
The exception handling is provided through a programmable exception target PC. When the classifier generates an exception, the classifier’s PC is directed to the programmed exception handler
location. Handling might simply consist of interrupting the Tile and freezing the classifier, or it
can execute special case instructions and branch back to PC=0 and continue processing the next
packet.
The exception target PC can also be set to 0, which causes the current header and descriptor to be
considered complete upon exception with no further exception processing needed.
The iHdr pointer is considered invalid if it points to the last byte of the iHdr window (0xff) or the
last byte of the packet since only one of the two bytes would be valid. If classifier software needs
to access the very last byte, it must set the iHdr to L2_Size-2 or 254 (whichever is smaller).
6.2.7.11
Classifier Configuration
At power on, the classifier’s enable bit is cleared, which freezes the classifier at PC=0. The instruction memory, exception PC, initial hPtr, initial pdPtr, and Table contents of the classifier are
loaded by Tile software as part of mPIPE initialization. Additionally, the GPRs can be preloaded
to provide constants for the classification program.
Once the program has been loaded, Tile software sets the enable bit and the classifier will begin
processing incoming packets.
The classifier instruction-memory, lookup-table, and GPRs are configured by writing to the
MPIPE_CLS_INIT_CTL and MPIPE_CLS_INIT_WDAT registers. Any or all of the classifiers can be
initialized simultaneously. The initial pdPtr and hPtr are configured by writing to the associated GPR specifiers via MPIPE_CLS_INIT_CTL/MPIPE_CLS_INIT_WDAT. The exception PC is
configured by initializing with GPR=R24(mempos).
Runtime changes to the classifier’s program can be made by disabling an individual classifier and
reloading program data. If the performance from the remaining classifiers is insufficient to process the incoming traffic flow, packets will be dropped while the classifier is being
reprogrammed.
Tile software cannot directly read the classifier’s SPRs and GPRs. But the classifier can be disabled
and a program loaded that will expose the architectural state via the classifier’s PASS SPR, which
is visible to mPIPE configuration space.
The classifier must be disabled to update the instruction memory or lookup table. However the
GPRs can be updated on an active classifier. This provides a means of communication between
Tile software and the classifier.
6.2.7.12 Classifier “Blast” Re/Programming
One or more classifiers can be reprogrammed using the “blast” programmer which provides
deterministic downtime and programming time for performing program updates on the fly.
A single classifier image is stored in the programmer. The image consists of state updates for the
instruction memory, table, and registers. Using the programmer, a full or partial state update to
the classifiers can be performed.
110
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
When the FLASH bit of the MPIPE_CLS_ENABLE register is written for one or more classifiers, the
associated classifiers will stop accepting new packets. Once all associated classifiers have finished
their current packet, they will be reprogrammed and re-enabled.
The programmer image is writable via the MPIPE_CLS_INIT_CTL/MPIPE_CLS_INIT_WDAT registers. When triggered by the FLASH bit in the MPIPE_CLS_ENABLE register, the programmer
begins reading at programmer-table entry 0. The programmer table is initially configured via
MMIO stores and contains records consisting of a command followed by 1 or more data words. The
final record in the table is a NULL command. The record format is described in Figure 6-10.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
Sel
Reserved
9
8
7
6
5
4
3
2
1
0
Data Count
Start Index
Record Data (Data Count Words)
Figure 6-10: Classifier “Blast” Programmer Record Format
Table 6-8. Buffer Descriptor Formats
Bits
Description
Data Count
Number of instructions, table entries, or registers to be programmed
Start Index
First index of the structure to be programmed.
Sel
0
1
2
3
Instructions
Table Entries
GPRs/hPtr/pdPtr/exc_pc
EndOfRecords
A programmer-table setup to program the entire classifier would look like Figure 6-11.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
0
0
0
1
1026-2049
2050
2051-2063
8
7
6
5
4
3
2
1
0
1024
Instruction 0
Instruction 1
…
Instruction 1023
1-1024
1025
9
2
0
table[1]
table[3]
…
table[2047]
0
GPR[1]
GPR[3]
…
GPR[21]
pdPtr
-----
0 (encodes 2048)
table[0]
table[2]
…
table[2046]
25
GPR[0]
GPR[2]
…
GPR[20]
hPtr
exc_pc
Figure 6-11: Classifier “Blast” Example
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
111
Chapter 6 mPIPE Architecture
Note that each record does not need to be the maximum size for that structure. For example, if
only 500 instructions are needed, then a smaller iMem record can be used. This will decrease the
programming time.
The initialization of hPtr (getpos) and pdPtr (putpos) sets the default value for those pointers
at the start of each packet. The initialization of tPtr (mempos) sets the exception PC.
The records can be in any order and record types can be repeated. If multiple records write the
same structure and location, the later record entry will overwrite prior entries.
The programmer table contains 2064 entries to allow all classifier states to be programmed if
needed. Note that if all 2064 entries are written, the end-of-records indicator is not needed.
The programming time is approximately 2.6 microseconds if all classifier state is being
programmed.
6.2.7.13 SPRs
The classifier provides special purpose registers (SPRs) for access to processor and packet state as
well as communication with Tile software. SPRs are accessed via MFSPR and MTSPR instructions.
The SPRs for the classification processor are defined in Appendix B: Classifier Instructions and SPRs.
The classifier’s PASS SPR can be read from Tile software via the MPIPE_CLS_INIT_CTL/MPIPE_CLS_INIT_WDAT registers.
6.2.7.14 Classifier Tools
Tilera provides tools and a baseline configuration to enable customization of the classification
program. The tool set includes a c-compiler, assembler, and simulator. For additional information,
refer to MDE mPIPE Programmer’s Guide (UG506).
6.2.8
iDMA Engine
The iDMA engine moves data from the iPkt buffer into memory space visible to worker Tiles.
The classification stage provides DMA control information including the buffer stack index, L2
padding, and checksum control. The load balancer indicates which notification ring should be
written when the DMA transfer has completed.
The iDMA engine consumes buffer descriptors from the buffer stack manager as needed and creates the linked-list chains within the buffers for iDMA scatter.
112
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
From MAC
Channel Active State
Current Size/iPkt-Block
clsQ
WorkQ
for Classifier
iPkt
Packet Data
BlkInfo
Per-Packet Info
(L2_size, CRC Err)
Buffers
Buffer
Manager
Done
Notification
Manager
Classifier
iDMA
LoadBal
CmdQ
CSUM
Retry
Write Packet Data
Figure 6-12: iDMA1
When cut-through is required, it is possible for the iDMA engine to run out of data on a given
flow. Rather than stalling and blocking other flows that could make progress, the iDMA engine
recirculates the stalled iDMA command back into the iDMA command queue.
Once a packet has been completely written into Tile-visible memory space, the notification manager is informed that the packet descriptor can now be written into the worker’s notification ring.
6.2.8.1
Temporal Hints for iDMA Writes
The Tile memory system provides a NonTemporal hint mechanism for describing the cache
access properties of data. The NonTemporal hint is useful for reducing cache pollution in cases
where the application might take a long time to access packet data after it arrives from the mPIPE.
For writes, this hint has no effect on the architectural state of the processor. It is only used
improve performance and/or determinism in the system.
The iDMA engine uses the NonTemporal hint bit (NT_HINT) from the associated buffer stack as a
hint to a cacheline’s home Tile as to whether or not the data is likely to be accessed by the application prior to being naturally displaced from the cache. Additionally, the TEMPORAL_CNT bit of the
MPIPE_IDMA_CTL register can be used to indicate the delineation between temporally local data
in the header versus temporally non-local data in the packet body.
1. Note that clsQ indicates size, channel, and handling information.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
113
Chapter 6 mPIPE Architecture
6.2.9
Load Balancer
The load balancer assigns incoming packets to workers. The classifier creates a hashed flowID that
it places into the BucketID of the packet descriptor. The load balancer uses the configuration of
the Bucket to determine how to distribute packets.
The load balancer allows all packets from the same bucket to be sent to the same worker Tile. Thus
packet descriptors for a given flow will be processed in order without requiring software to maintain ordering and affinity. Override modes allow other distribution schemes.
Implementation Note: The TILE-Gx36 device implements 4160 hash buckets. This allows a simple 12-bit
hash function to map to the low 4K buckets while reserving 64 buckets for dedicated special purpose
flows and applications.
6.2.9.1
BucketSTS
The bucket status (BucketSTS) table keeps track of the number of packets inflight on a per bucket
basis. For example, if the classifier determines that a packet ‘P’ belongs in bucket ‘B’ and no packets are queued for processing in bucket ‘B’, then the packet descriptor for ‘P’ can be sent to any
eligible worker Tile – so the load balancer chooses the least busy worker ‘W’. However, once ‘P’
has been enqueued for processing at a specific worker, all subsequent packets for bucket ‘B’ must
go to worker ‘W’ in order to maintain flow order.
A counter associated with each bucket is stored in the BucketSTS table. When the counter is
zero, the load balancer can assign any eligible notification ring (worker). The NotifRing index is
then written into the BucketSTS table for the associated bucket. When the counter is nonzero, the
current NotifRing index stored in the BucketSTS table will be used rather than picking a new
NotifRing.
Each time a packet descriptor is enqueued, the associated bucket’s counter is incremented. When
the worker has completed processing a packet, it sends a message to the mPIPE, which decrements the bucket’s counter.
Since a bucket is associated with a single NotifRing at any given time and the bucket counter is
large enough to count the maximum packets that can fit in a ring, it is not possible for the bucket
counter to overflow if software is releasing the bucket each time it releases the NotifRing.
Note that in the “order-agnostic” override flow described in section “Load Balance Override
Flows” on page 117, it is possible for a bucket to have packets outstanding to multiple notification
rings. In this case, the bucket counter can wrap. But since the bucket count is being ignored, this
does not impact operation. And since the counter wraps rather than saturating, it will return to
zero once all packets have been processed.
6.2.9.2
Notification Groups
The load balancer provides notification groups to support multiple load balancing domains. This
allows groups of Tiles to be associated with specific buckets. Hence, when the classifier maps a
packet to a specific bucket, it is also indicating a subset of notification rings that are eligible to
receive the packet.
Each bucket in the BucketSTS table contains a NotifRingGroup Index. This index is used to
lookup a table which provides a bit mask of all eligible notification rings for the associated group.
Implementation Note: The TILE-Gx36 device provides 32 notification groups. Each group has a 256-bit
vector indicating all notification rings allowed to receive packets for buckets that map to that group.
114
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
Classifier
Bucket
BucketSTS
Table
NotifGroup
Inflight Count
Inflight NotifRing
Group
NotifGroup Table
Eligible NotifRings
Pick
NotifRing
NotifRing
Table
WorkerID
NotifRing
address/Head/Tail
Figure 6-13: Load Balancer
6.2.9.3
Notification Ring Arbitration
A tiered round-robin arbitration scheme is used to select a specific notification ring from the eligible notification rings indicated by the notification group associated with a specific bucket. The
arbitration stage chooses the least full notification ring from among the group of eligible rings.
The rings’ fullness is quantized into eight states based on programmable thresholds (MPIPE_LBL_QUANT_THRESHn registers, MPIPE_LBL_QUANT_THRESH0, for example). This simplifies the
arbitration decision while still providing fair load balancing. The load balancer will prefer the
least-full notification rings and will choose round-robin between notification rings at the same
fullness quantization level. The highest state is “full”. Once a notification ring has reached the full
state, no more packets will be sent to that ring.
The threshold is automatically masked based on the ring size such that only the low-N bits are
considered when comparing head and tail pointers in a ring of size 2 N. Care must be taken to
insure that the thresholds are ascending when masked base on all active RingSizes in the system. The reset values of the thresholds provide an example of correctly programmed thresholds
for all possible ring sizes.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
115
Chapter 6 mPIPE Architecture
Figure 6-14 shows the default quantization thresholds on a log scale.
Default Quantization Values (log scale)
100000
65534
12979
Number of Descriptors
10000
2069
2046
1000
691
510
179
126
100
51
RingSize128
21
10
RingSize512
RingSize2K
9
RingSize64K
4
2
1
1
0
1
2
3
4
5
6
Threshold Number
Figure 6-14: Default Quantification Thresholds
Table 6-9. Default Load Balancer Quantization
RingSize128
RingSize512
RingSize2K
RingSize64K
Quantification Register
126
510
2046
65534
THR6(full)
51
179
691
12979
THR5
21
21
21
2069
THR4
9
9
9
9
THR3
4
4
4
4
THR2
2
2
2
2
THR1
1
1
1
1
THR0
To enhance the fairness of load balancing, the picker chooses round-robin within each notification
group. In other words, for all NotifRings in a group at the lowest fullness state, the picker will
choose round-robin by remembering which NotifRing was previously chosen for the group.
Each group maintains the state of which NotifRing was last chosen and this state is updated
only when the picker has been used. Thus a bucket that already has a nonzero count will not
update the picker’s round-robin state.
116
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
6.2.9.4
Load Balance Override Flows
The load balance function can be overridden in several ways to support multiple application distribution models running simultaneously in the system. Each bucket can be individually
configured into one of the following modes at initialization via the MPIPE_LBL_INIT_CTL/
MPIPE_LBL_INIT_DAT registers:
•
DFA (dynamic flow affinity): The bucket assigned to the least busy NotifRing in NotifRingGroup when the counter is zero.
•
FIXED: Bucket is statically assigned to a specific NotifRing hence NotifRingGroup/picker
not used.
•
ALWAYS_PICK: Always select least busy worker. The application will perform its own locking
and ordering on per-flow state as needed. Note that since the bucket can be going to multiple
rings, it is possible for the bucket counter to wrap. Although the counter is not used in this case,
it will return to zero if software properly releases the bucket for all packets.
•
STICKY: Sticky flow affinity. The NotifRing will be assigned using the picker ONLY when
the current NotifRing is full. Note that in sticky mode, the initial LBL_INIT_DAT_BSTS_TBL.NOTIFRING setting in this register will be used until that NotifRing
becomes full. Software must insure that the initial NotifRing is valid.
The classifier selects buckets with mode attributes appropriate for the flow being distributed.
The load balancer can also be overridden on a per-packet basis by the classifier by setting the NR
bit in the packet descriptor. In this case, the NotifRing is selected by the classifier and the load
balancer otherwise acts as if it is in “Static Assignment” mode. The bucket’s counter and current
notification ring will not be updated. Packet descriptor based overrides take precedent over the
bucket override modes.
When software processes a packet with the NR bit set in the packet descriptor, it must NOT
release the bucket since that packet does not increment the bucket count.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
117
Chapter 6 mPIPE Architecture
Table 6-10 summarizes the various types of load balancer flows:
Table 6-10. Load Balancer Flows
Flow Type
Bucket Count
Bucket Current NR
Picker RR-Arb State
NR-Tail
DFA
Incremented. Packet
is dropped if count
reaches 64K.
Updated by picker if
count was zero
Updated if count was
zero
Updateda
FIXED
Incremented
Counter wraps if it
Not Updated
Not Updated
Updatedb
Updated
Updated if current NR
is full.
Updated if NR is
updated.
ALWAYS_PICK
reaches 64K.b
STICKY
Descriptor (Classifier)
Override
Not Incremented
Not Updated
Not Updated
Drop due to NR full, DFA
mode with bucket full, or
classifier-drop
Not Incremented
Not Updated
Not Updated
Not Updated
a. If the packet is dropped after load balancing, for example due to running out of buffers, the NR-Tail and count will NOT
be updated.
b. Although this state is updated, it is generally not relevant for this type of override flow.
6.2.10 Checksum
The mPIPE provides dedicated checksum offload hardware typically used for TCP processing.
The checksum’s seed and start location are generated by the classifier. The checksum is calculated
by performing a 16-bit 1’s complement addition on all of the bytes from the start byte to the end of
the packet (inclusive). The result of the checksum calculation is delivered as part of the packet
descriptor at notification time.
Packets with bad checksums can be re-distributed to software managed exception queues simply
by writing the packet descriptor into a different queue. The packet data does not have to be
moved.
The classifier also supports 1’s complement addition in its ALU to provide checksum calculations
across L3 headers as part of any IP header validation requirements.
6.2.11 Notification
Once a packet’s data and packet descriptor have been written into Tile-visible memory space, the
application is notified that it can begin processing the packet. This notification can be done in one
of two ways: by writing a copy of the NotifRing tail pointer for software to poll, or by dispatching an interrupt.
118
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Services
Write ACKs
Coherence
Tracker
Classifier
Write NR Tail
or Interrupt
LoadBal
SQN
&
CTRs
pDesc
pDesc Buffer
iDMA
xfer Done:
pDesc-handle,
BuffDesc,
NotifRing
Sequence
Numbers
Notif
Gen
NotifQ
Write Packet
Curr Tail
NotifRing
State
Write pDesc
To NotifRing
Figure 6-15: Notification Flow
6.2.11.1
Tail Pointer Updates – Polling Model
For high bandwidth applications that want to poll for new packet arrival on one of their notification rings, the mPIPE supports an automatic tail-pointer update as packet descriptors are written
into the notification ring.
A copy of the ring’s tail pointer is stored in the first 8-bytes of the ring. The next 56 bytes in the
ring are reserved although they can be used by software if the TUP_PTC bit of the MPIPE_NTF_CTL register is clear.
Each time a new packet descriptor is written to the ring and is visible to Tile software, the notification manager performs a memory space write to update the tail. The master copy of the Tail is
always maintained in the NotifRingTable at the mPIPE and can be accessed through MMIO to
the mPIPE configuration space.
6.2.11.2
Notification Interrupts
Each notification ring has an associated interrupt. This interrupt can be enabled by software to
allow the worker tile to receive interrupts on new packet arrival.
The interrupt supports a self-masking mode so that once an interrupt is delivered; subsequent
ring updates will not trigger an interrupt unless software clears the interrupt status bit. This provides a Linux “NAPI” style of packet delivery where the application can switch between interrupt
and polling driven work queues, depending on the bandwidth requirements.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
119
Chapter 6 mPIPE Architecture
6.2.11.3
Timestamp and Sequence Number Information
The mPIPE provides timestamp and sequence number information as part of the packet descriptor. Timestamps provide a low-jitter indication of a packet’s arrival time. The timestamp is
formatted similar to Linux’s NTP using 32-bits of nanosecond and 32-bits of second.
The timer that delivers the timestamps can be adjusted/synchronized using Tile software, as
needed. The timestamp is applied when the first byte of the packet is sent from the MAC into the
iPkt buffer. The timestamp is automatically corrected by hardware to account for varying
latency through the MAC to the mPIPE. Thus jitter is less than 50ns for 1000Mbps or faster ports –
even between ports running at different speeds.
Two sequence numbers are also provided by the mPIPE. The first is a global packet 48-bit
sequence number applied and incremented for each packet. This allows software to determine a
global order across all packets, regardless of the bucket to which the packet was mapped.
The second sequence number is a configurable 16-bit sequence number. This sequence number is
provided from a table indexed by the GP_SQN field generated by the classifier. This allows the
classifier to provide independent sequence numbers on many different flows, for example on a
channel or VLAN basis.
Implementation Note: The TILE-Gx36 device provides 4160 general purpose sequence numbers thus the
classification program can assign one per hash bucket if desired.
Both types of sequence numbers are generated and incremented as packet descriptors are written
to ensure that any packets dropped by the classifier program or by the iDMA engine due to buffer
starvation do NOT get assigned a sequence number.
6.2.12 Counters
The mPIPE supports 32 general-purpose counters. The packet descriptor generated by the classifier contains two 5-bit encoded values specifying which counters should be incremented when the
packet descriptor is written into a notification ring.
The counters are each 48-bits, saturating, read-to-clear, writable, and generate an interrupt on saturation. If both counter-selects are the same, the associated counter will only be incremented once.
The interrupt associated with each counter will assert when the counter reaches saturation and on
each increment that occurs when the counter is saturated.
Both the counters and the sequence numbers are accessed through the MPIPE_SQN_CTR_CTL and
MPIPE_SQN_CTR_DAT registers. The counters and sequence numbers are initialized to zero at
reset.
6.2.13 Software Override Flows
The mPIPE is designed to satisfy most system architectures without requiring any software/Tile
resources for packet distribution. However, there will be some systems or circumstances within a
system that require software to assume the roles of classification, load balancing, or buffer
management.
6.2.13.1 Software Classification
If the classifier is unable to determine to which bucket a packet should be assigned, the load balancer will be unable to guarantee ordering with other packets from the same flow. Rather than
blocking all flows behind unclassified flows, the unclassified flows can be assigned by the classifier to a bucket and buffer stack reserved for classification processing.
120
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Ingress Channel Flow Control
Software is responsible for processing the exception buckets – the packet descriptor generated by
the classifier will be presented in the notification ring and software can decide how to further
classify and distribute this packet. A dedicated notification group can be used to provide load
balancing of the exception flow across multiple worker Tiles.
The ordering of software-classified packets with respect to hardware-classified packets must be
managed by software as required. The timestamp of the packet can be used to insert softwareclassified packets into the hardware-classified flow.
6.2.13.2 Software Load Balancing
In systems where the hardware load balance architecture is insufficient, software load balancing
can be achieved by using a single notification ring and having software copy packet descriptors
from the hardware managed notification ring to software managed rings. Since packet data is
stored in the Tile processor’s L3 cache, the packet data itself does not need to be moved. And
since the packet descriptors fit within a cacheline and the cache system is optimized to move
cache blocks, the expense of copying a cache block is generally not significantly higher than other
smaller-grained communication.
6.2.13.3 Software Buffer Management
For systems that require a buffer management scheme that cannot be realized with the mPIPE’s
buffer stack mechanism, software must have a way to manage buffers on its own.
To support software buffer management, the mPIPE can be directed to move data into a limited
set of buffers on a buffer stack. These buffers could be homed on a specific Tile or set of Tiles.
Software running on dedicated Tiles would then copy data from the hardware managed buffers
into software managed buffers. This copy operation must be supported at line rate hence sufficient Tile resources must be provided to support a line rate L3 to L3 memcopy.
The bandwidth-delay product of the descriptor notification, memcopy, and buffer return operations dictates the size of the buffer pool required for the hardware’s buffer stack. This product
will typically be significantly smaller than a single Tile’s L2 cache.
6.3 Ingress Channel Flow Control
For multi-channel implementations, such as Interlaken or 802.1Qbb ports that support flow control, the mPIPE provides per priority queue iPkt buffer occupancy counters. The counters
increment each time a 128-byte iPkt block is consumed on the associated priority queue. The
counter is decremented when the block is freed (DMA operation completed).
Each MAC has controls to either select the queue from the packet data or override the queue
number. The override can be used on links that are not connected to an 802.1Qbb fabric but still
require priority queueing.
When the counter reaches a high water mark as programmed in the MPIPE_PR_PAUSE_THR registers, back pressure is applied to the MAC using the mechanism defined for the particular
interface (for example pause frames, priority pause, or other inband flow control).
Implementation Note: The TILE-Gx36 device provides 16 priority queues. One for each 802.1Qbb
priority queue plus eight additional queues so that each GbE port can be mapped to its own queue
depending on the system configuration.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
121
Chapter 6 mPIPE Architecture
6.4 Packet Drops
With a properly configured system, the mPIPE will not drop packets. However, the conditions
below will cause packets to be dropped or truncated.
6.4.1
Drop/Truncate: iPkt Full
If the iPkt buffer fills up while a packet is arriving from the MAC Interface, the packet will be
truncated. This will be reflected in the packet’s descriptor. If the truncation occurs prior to any
cut-through, the classifier will be informed of the truncation via its HEADER_FLAGS SPR.
If the iPkt buffer is full when a new packet arrives, the packet will be completely dropped and
the associated channel’s drop-count incremented.
The following conditions can lead to the iPkt buffer filling up:
•
The classification program is not achieving line rate.
•
The buffer stack manager is unable to provide buffers fast enough since the LWM bit of the
MPIPE_BSM_CTL register is set too low or stack manager is encountering unexpectedly high
read latency.
•
There is insufficient mesh bandwidth for packet or descriptor writes.
•
There is excessive latency for packet or descriptor write-acks.
•
The clock rate is too low for the classifier or mPIPE.
6.4.2
Drop: Classifier Cycle-Budget
If the classifier’s header queue is filling up and header processing has exceeded the cycle budget,
the classifier will terminate processing and apply the default NotifRing and Dest. The default
dest can be DROP hence this can lead to dropping the packet. See “Cycle Budget” on page 104.
6.4.3
Drop: Classifier Program
The classifier can decide to drop packets based on the contents of the incoming packet. The DEST
field in the packet descriptor allows the classifier to drop the packet data and the descriptor. Or
drop just the packet data. The latter is useful for cases where the packet is already in memory and
the classifier is being used on an eDMA loopback path. Or when the packet data has been completely analyzed by the classifier and there is no need for application software to examine the
data.
6.4.4
Drop: NotifRing Full
If workers are not draining their NotifRings fast enough, the ring will fill up. If this happens,
the load balancer will drop both the descriptor and the packet data, and then increment the
INGRESS_DROP_COUNT register.
6.4.5
Drop: Bucket Count Full
If workers are releasing the NotifRing without releasing the bucket in the Load Balancer Bucket
Status Data register (MPIPE_LBL_INIT_DAT_BSTS_TBL), it is possible for the bucket counter to
reach its maximum value (64K) without the NotifRing becoming full. If the bucket is configured
in DFA mode, descriptors arriving at the load balancer for the associated bucket will be dropped.
For buckets configured in FIXED, ALWAYS_PICK, or STICKY mode, the bucket count is allowed to
wrap and packets will not be dropped when the counter reaches its maximum value.
122
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Egress Services
6.4.6
Drop/Truncate: Out of Buffers
If the stack assigned to a packet has no buffers, the iDMA engine will drop the packet. If the
packet is chained across multiple buffers, it will be truncated at the point the iDMA engine ran
out of buffers.
The BE (buffer error) bit will be asserted in the packet descriptor, and the buffer chain will be terminated by a buffer descriptor with a CHAIN_INVALID indicator. The packet descriptor’s TR
(truncate) bit will not be set for this case. The packet descriptor’s L2_Size field will indicate how
many bytes were written to Tile memory, not the total packet size. For packets that have cutthrough, there will NOT be an extra buffer descriptor as is usually present. See “Buffer Release”
on page 98.
6.5 Egress Services
Packets being sent from Tile memory space to the I/O device use the egress flow of the mPIPE.
Similar to the ingress flow, the egress portion of the mPIPE is channelized. Each egress channel
has its own eDMA descriptor ring and is non-blocking between channels.
6.5.1
Typical Egress Flow
MAC Distribution
Egress
Channel
Picker
5
4
0|1|2|...|N
Buffer Release
To Tile or
Buffer Stack Engine
eDMA
3
ePkt
eDMA Picker
Read Packet
Descriptor
Manager
0|1|2|...|N
2
Read Desc
descBuf
1
SW Post Message with Ring and Index
Figure 6-16: Typical Egress Flow
Egress packets typically use the following flow:
1.
Software writes an egress descriptor in a ring in memory space and optionally sends an
“egress post” MMIO write to the mPIPE.
2.
The egress descriptor manager reads the egress descriptors from the ring stored in memory
space. The egress descriptor describes the eDMA transaction to be performed.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
123
Chapter 6 mPIPE Architecture
3.
The packet data is read from memory space and written into the ePkt buffer.
4.
The data buffer is released back to the buffer stack manager or to software.
5.
Once packets are ready for egress, a picker sends channelized packets to the MAC(s).
6.5.2
eDMA Packet Descriptors
Packets destined for egress are defined using one or more eDMA descriptors. These descriptors
are stored in rings in Tile memory space. Note: MiCA also uses the eDMA Descriptor format, refer
to Section 11.2.1.3 Overview and Major Functional Blocks). The mPIPE provides multiple rings to
support nonblocking egress flows and differentiated classes of service on the same egress channel.
The rings can be set to one of four sizes: 512, 2K, 8K, or 65536 descriptors (each descriptor is 16
bytes).
Implementation Note: The TILE-Gx36 device has 24 egress descriptor rings.
Buffer
Descriptor
CSUM_SEED
CSUM_START
VA[31:7]
C
Size
HWB
Reserved
StackIDX
8
7
6
5
4
3
Reserved
CSUM_DEST
Offset
Reserved
2
VA[41:32]
1
0
Gen
9
NS
Notif
Reserved
Size
Bound
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
CSUM
Each ring is associated with a particular channel and multiple rings can be assigned to the same
channel. The eDMA descriptor format is shown in Figure 6-17 and described in detail in Table 611 (note that the last 8 bytes comprise a buffer descriptor in the same format that iDMA uses).
0x00
0x04
0x08
0x0C
Figure 6-17: eDMA Descriptor Format
Table 6-11. eDMA Packet Descriptors
124
Bits
Field
Description
29:16
Size
Number of bytes to be sent for this descriptor. When 0, no data will be moved and the
buffer descriptor will be ignored. If the buffer descriptor indicates that it is chained, the
low 7 bits of the VA indicate the offset within the first buffer (that is, 127 bytes is the maximum offset into the first buffer). If the size exceeds a single buffer, subsequent buffer
descriptors will be fetched prior to processing the next eDMA descriptor in the ring.
15:8
CSUM_START
Start Byte of Checksum. The checksum start is relative to first byte of this descriptor. If
multiple descriptors for the same packet have CSUM enabled, behavior is unpredictable.
7:0
CSUM_DEST
Destination of checksum relative to the first byte of this descriptor. The destination bytes
fall within the current descriptor space. CSUM_DEST must be less that what is specified
in the Size bitfield.
11
Bound
Boundary Bit. This transfer includes the EOP for this command. Clear on all but the last
descriptor for an egress packet.
10
Notif
Notification interrupt will be delivered when the descriptor has been completely processed.
9
NS
NoSend. Nothing to be sent (packet was dropped by software). All buffers will be processed and returned as appropriate.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Egress Services
Table 6-11. eDMA Packet Descriptors (continued)
Bits
Field
Description
8
CSUM
Checksum Generation Enabled. Once enabled, subsequent descriptors must not set the
CSUM bit. Checksum will be calculated on bytes from CSUM_START to the end of the
packet.
0
Gen
Generation Number. Used to indicate a valid descriptor ring.
For more information about Buffer Descriptor formats, see Figure 6-3 on page 95.
6.5.2.1
eDMA Descriptor Fetch
The descriptor manager prefetches descriptors to optimize performance and bandwidth. In order
to reduce the Tile memory system bandwidth dedicated to descriptor fetches, the descriptor manager tracks the state of a window of descriptors starting at the current head pointer.
Each descriptor is in one of four states:
•
UNKNOWN. Indicates the number of the descriptor state not known, HW must fetch.
•
KNOWN_INVALID. Indicates the number of the descriptor is not valid, wait for post or hunt.
•
KNOWN_VALID. Indicates the number of the descriptor is valid, HW must fetch.
•
DONE. Indicates the number is already fetched and valid.
A software descriptor posts state information within the window of descriptors ahead of the head
pointer, which causes the state to go to KNOWN_VALID. The hardware fetches descriptors starting
at the head pointer that are in the UNKNOWN or KNOWN_INVALID states. A fetched descriptor will
update the state from UNKNOWN to KNOWN_INVALID or KNOWN_VALID.
Implementation Note: The TILE-Gx36 device monitors incoming posts in a window of 64 descriptor
locations in advance of the head pointer in order to prevent excess reads of invalid descriptors.
Posts beyond the window being monitored by the hardware will not have any affect, but the
descriptor engine will find the descriptor since it will fetch any descriptors in the UNKNOWN state.
6.5.2.2
eDMA Descriptor Hunt Mode
Each ring has an optional hunt mode configured by the MPIPE_EDMA_DM_INIT_DAT register.
This mode allows the descriptor engine to search for valid descriptors even if the state of the
descriptor is KNOWN_INVALID.
Rings that are not in hunt mode will not fetch any descriptors if the state of the head pointer is
KNOWN_INVALID. Thus software must issue a post, as described in “Explicit eDMA Descriptor
Post” on page 126 to cause the descriptor(s) to be processed.
Rings that have been configured with hunt mode enabled do not require any software posting.
Instead, the descriptor manager will periodically check for valid descriptors on the ring(s). The
HUNT_CYCLES bit setting of the MPIPE_EDMA_CTL register controls how often the descriptor
fetcher will read a ring that is in the KNOWN_INVALID state.
Once descriptors are found on a ring, the descriptor manager will continue to fetch descriptors
aggressively until it finds an invalid descriptor. Thus high performance applications with bursty
behavior can operate without any posts at all.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
125
Chapter 6 mPIPE Architecture
In order to increase the responsiveness of a ring in hunt mode, software can optionally send a
descriptor post with the count field set to 0. This will “wake up” the ring and cause it to look for
descriptors without waiting for the hunt-mode timer to expire. This reduces the latency and jitter
on egress descriptor processing but is not otherwise required for high performance egress operation. Figure 6-18 shows the behavior for a ring operating in hunt mode.
Fast Mode
Req MaxReq Desc
Post-0
Hunt Mode
(Sleep)
Timer Expired,
Win Ring Arb
Win Ring Arb
Response:
All Desc Valid
Response:
Some Desc
Not Valid
Send Request(s)
Wait for Response
Figure 6-18: eDMA Descriptor Ring Behavior in Hunt Mode
Note that the descriptor engine will prefetch descriptors beyond the head pointer regardless of
hunt mode. Hence the ordering of packet data writes and descriptor writes must be maintained as
described in section “Descriptor Prefetch and Memory Ordering” on page 127.
6.5.2.3
Explicit eDMA Descriptor Post
Software can optionally inform the eDMA descriptor manager that a descriptor or set-of-descriptors is valid by issuing an MMIO store to the associated eDMA ring. The store data contains the
current tail pointer and a count of how many descriptors are valid. Although the hardware is
designed to maintain line rate with single-descriptor posts, software can reduce the required
descriptor-read bandwidth by batching two or more descriptors per posting.
Explicitly posting descriptors improves the response time of the descriptor fetch engine and provides more predictable ring interleaving when multiple rings are active on the same egress
channel.
Egress rings that are not in hunt mode are required to use explicit descriptor posts.
6.5.2.4
eDMA Descriptor Ring Reordering
In order to allow worker Tiles to write eDMA descriptors to the ring in any order, the eDMA
descriptor engine supports a valid indicator in each descriptor. Workers can be assigned slots in
the ring and can post their eDMA descriptor at anytime without regard for other workers that are
sharing the ring.
The descriptor engine will process the descriptors as they become valid from oldest to newest
(head to tail). Thus the tail pointers in MMIO posts sent to the descriptor manager are NOT
required to be in ascending order.
To prevent the need for clearing the valid bit each time a descriptor is processed, the valid indicator is implemented using a generation number. The generation number is incremented each time
the ring wraps, hence the hardware can always tell when a descriptor is valid by comparing the
current generation number to the generation number stored in the descriptor.
126
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Egress Services
Each time a descriptor is written into the ring, software optionally sends a message to the descriptor manager with the location that was written (these can also be batched by software, as
described below). If the location posted matches the current head pointer, the descriptor manager
will begin fetching descriptors for the eDMA engine. It will continue fetching until it encounters
an invalid descriptor at which point it waits for a post to the oldest location in the ring.
The eDMA hardware also supports batch posting of descriptors. The message sent to the descriptor manager contains a count of the number of descriptors that have become valid. If a batch of
descriptors crosses the ring boundary, the descriptor manager will automatically wrap to the
beginning of the ring. Software is encouraged to batch descriptors if possible as a way to reduce
the overall Tile-memory bandwidth demand.
6.5.2.5
Descriptor Prefetch and Memory Ordering
In order to maintain line-rate performance, the descriptor manager prefetches descriptors from
the ring based on various heuristics. For example, the descriptor manager can prefetch descriptors to the next cacheline boundary or even additional cachelines under certain circumstances.
Due to this hardware prefetching, software must never set the valid bit (generation number) on a
descriptor unless that descriptor is recognized as being valid. This also means that software must
write all other bytes of the descriptor prior to writing the generation number field of the descriptor. The TILE-Gx processor memory system guarantees that writes to the same cacheline, and
hence to the same eDMA descriptor, will be observed by the descriptor manager in order.
No fence is needed between the writes to the descriptor data and the descriptor’s generation
number as long as the second 8 bytes of the descriptor is written in program order prior to the
first 8-bytes.
6.5.2.6
Descriptor-Write and Descriptor-Post Ordering
When software posts descriptors, it will typically write the descriptor data, issue a memory fence,
then send the post MMIO write to the eDMA engine.
The eDMA descriptor manager supports an optional hunt mode that allows software to operate
without the memory fence described above.
In this mode, software can write the descriptor data then issue the post without an intervening
memory fence. This might cause the descriptor manager to fetch descriptors that are not yet valid.
In hunt mode, the descriptor engine will continue to refetch the descriptor until it is valid. This
mode trades software performance for potential extra read/response bandwidth in the case
where the descriptor data has not yet become visible. The other potential risk with hunt mode is
that an inadvertent software post of an invalid descriptor will cause the descriptor engine to continue to refetch the invalid descriptor “forever”. However, this will be rate limited by the
HUNT_CYCLES setting.
6.5.2.7
Ring to Channel Mapping
Each descriptor ring is associated with a single egress channel. Multiple rings can be assigned to
the same channel in order to provide independent rings for different software applications or
classes of service. Independent rings will not cause head-of-line blocking with each other hence a
low priority flow and a high priority flow on the same egress channel but on different rings will
not interfere. When packets are available from multiple rings targeting the same egress channel,
round-robin arbitration is used to determine egress packet order. The mapping of ring to channel
is configured via the MPIPE_EDMA_DM_INIT_CTL/MPIPE_EDMA_DM_INIT_DAT registers.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
127
Chapter 6 mPIPE Architecture
6.5.2.8
Descriptor Errors
The eDMA engine detects descriptor errors and faults. When an error is detected, the associated
ring is frozen and software must flush the DMA ring, as described in section “EDMA Ring Drain”
on page 140.
The eDMA descriptor errors that are detected are described in Table 6-12.
Table 6-12. eDMA Fault Handling
Error Type
Description
Handling
Illegal Stack
Descriptor attempted to return buffers to a stack whose
STACK_ENA bit was clear in the MPIPE_EDMA_RG_INIT_DAT_STACK_PROT register.
Descriptor is discarded, ring is frozen,
the EDMA_DESC_DISCARD interrupt
is triggered.
Illegal ASID
ASID_ENA bit was clear in the MPIPE_EDMA_RG_INIT_DAT_STACK_PROT register for the stack that was specified in the buffer descriptor.
Size Error
Size of transfer exceeded the size of hardware-returned
unchained buffer. If buffer is NOT returned to hardware or
buffer is chained, this error will not be flagged.
TLB Fault
No translation found for ASID/VA.
6.5.3
If associated ASID’s MPIPE_EDMA_ASID_FAULT_MODE register is
set, handling is identical to the errors
above (discard/freeze/interrupt). If the
FAULT_MODE bit is clear, the
descriptor will be retried until a valid
translation is installed.
Buffers
The egress portion of the mPIPE uses the same buffer descriptor format as the one used in the
ingress mPIPE (refer to Figure 6-3 on page 95). This allows buffer descriptors to be automatically
recycled back to the buffer stack engine on egress.
6.5.3.1
Chaining
To support line rate gather compatible with the iDMA chaining architecture, the eDMA engine
supports linked-list-chained buffers where each buffer contains a buffer descriptor for the next
buffer in the chain. The buffer descriptor is stored in the first 8 bytes of each buffer. The eDMA
hardware determines when it has finished fetching buffers, based on the Size field stored in the
eDMA packet descriptor and the size associated with each buffer descriptor it fetches.
6.5.3.2
Descriptor-Based Gather
When the system requires high performance from a single ring and the buffer size is relatively
small (less than 512 bytes), the hardware linked-list chase might not provide sufficient bandwidth
since too few Tile memory reads can be launched to keep the egress pipe busy.
In this case, software can instead provide a set of descriptors forming a gather list provided to the
eDMA engine. This allows the hardware to prefetch data in order to keep up with the system line
rate. If the data in Tile memory is stored in linked-list chain format, the eDMA descriptors must
point to the data portion of each linked-list buffer (for example skip over the first 8 bytes in the
buffer). All but the last eDMA descriptor must have its Boundary bit cleared indicating that subsequent descriptors make up the egress packet.
128
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Egress Services
If software is using large buffers (512 or larger) or many simultaneous channels, C=1 mode is sufficient (buffer pointers stored in 1st 8 bytes of each buffer). The eDMA engine will also read ahead
multiple descriptors so small packet performance is not impacted by the buffer chaining “gather”
issue.
6.5.3.3
Transaction Sizing and Buffer Offsets
The interaction between the eDMA descriptor size field, buffer descriptor chain field, buffer
descriptor size field, descriptor offset, and hardware release field warrants detailed explanation
and rules.
•
The eDMA size field dictates the number of bytes moved by the descriptor. This value does
NOT include the 8-byte chain pointers that can be included in the first 8-bytes of each buffer.
•
If the buffer descriptor in the eDMA descriptor indicates that the buffer is chained, ALL buffers
associated with this eDMA descriptor contain a chain pointer in the first 8 bytes. The one in the
final buffer can be marked as INVALID. Hardware will return the empty buffer associated with
the final buffer pointer if the DISABLE_FINAL_BUF_RTN bit of the MPIPE_EDMA_DIAG_CTL
register is zero and the buffer is not marked INVALID. This occurs when an ingress packet was
cut through and chained.
•
The VA provided by the buffer descriptor inside the eDMA descriptor provides the starting byte
location of the packet data. When buffers are chained, the offset is inclusive of the 8-byte chain
pointer field contained in the first 8 bytes of all chained buffers. When releasing buffers to hardware, the buffer stack manager does not store the low 7 VA bits.
•
Since buffers are only required to be aligned to 128B boundaries, it is not possible for hardware
to process chained buffers with an offset larger than 127. In other words, the location of the
chain pointer is derived by clearing the low 7 bits of the buffer VA. If software wishes to send
packet data starting more than 127 bytes into a buffer, it must release the buffer back to software and NOT use buffer chaining.
•
For buffers with the HWB bit set (hardware release), a size error is detected if the buffer is
unchained and exceeds the boundary of the buffer based on the encoded buffer size. When a
size error is encountered, the descriptor will be discarded, an EDMA_DESC_DISCARD interrupt
will be triggered, and the ring will be frozen.
•
The above size error check is NOT performed on buffers with the HWB bit clear because this
check would not be possible in all cases since the buffer’s actual start address is not actually
known (offset could be larger than 127 bytes).
•
Each buffer stack is associated with a single buffer size. If an eDMA buffer descriptor is
returned to a hardware stack with a different size configuration, the buffer will be treated as
having the size associated with the stack upon reuse for iDMA.
6.5.3.4
Buffer Release
Each buffer descriptor used for eDMA contains HWB field, which indicates whether or not the buffer should be returned to the stack engine.
When releasing buffers to hardware, the SIZE field in the buffer descriptor must match the size
configured for the associated buffer stack. If these do not match, buffers might be lost from the
associated stack or data within the associated ASID region might be corrupted.
If software wants to manage its own buffers, the HWB bit must be clear and software need to determine that the eDMA transaction is complete using either the eDMA-ring interrupt or by reading
the descriptor-complete count in the eDMA-ring MMIO structure.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
129
Chapter 6 mPIPE Architecture
6.5.3.5
Egress VA Translations
Egress packet descriptors contain a buffer descriptor with a virtual address. This VA is translated
into a physical address by the eDMA descriptor processing hardware using the ASID/region process described in Section 6.6 Virtual Memory. The StackIDX provided in the each buffer
descriptor is used to determine which ASID (set of TLB entries) is used for the translation.
If the HWB bit is set in the buffer descriptor, the buffer will be returned to hardware and an additional protection check is provided via the Stack Protection bits in the
MPIPE_EDMA_RG_INIT_CTL/MPIPE_EDMA_RG_INIT_DAT registers.
6.5.4
eDMA Engine
As valid descriptors are fetched by the descriptor manager, they are presented to the eDMA
engine. Since multiple channels might have valid descriptors pending, a picker determines which
descriptors to process based on ePKT (buffer) availability and round-robin fairness arbitration.
Once a descriptor has been chosen, the eDMA engine reads the data from Tile memory space and
writes the data into the ePKT buffer queue associated with the descriptor’s channel. Buffers are
then freed to the buffer stack manager or messaged to software.
In order to maintain line rate performance on any single flow, the eDMA engine performs descriptor processing (packet reads), response processing (write into ePkt buffer), and notification in
parallel. This allows many simultaneous packet gather threads to be running in parallel to hide
memory read latency, Tile resource contention, and temporary network congestion.
6.5.5
ePkt Buffering
The ePkt buffer provides elasticity between the Tile memory system and the egress Mac Interface.
This elasticity prevents variations in eDMA read latency from affecting the output bandwidth.
The ePkt buffer size is measured in 128-byte blocks. Packets always consume an integer number of
blocks and a block cannot be shared between two different packets. The buffer size is provided in
the MPIPE_EDMA_STS register.
The ePkt buffer blocks are divided into an undifferentiated pool and a reserved pool. Undifferentiated buffers can be consumed by any ring that has its DB bit set in the THRESH structure of the
MPIPE_EDMA_RG_INIT_CTL/MPIPE_EDMA_RG_INIT_DAT register set.
The reserved pool is divided between the eDMA rings based on the MAX_BLKS thresholds, which
are programmable on a per-ring basis via the MPIPE_EDMA_RG_INIT_CTL/MPIPE_EDMA_RG_INIT_DAT registers.
Head of line blocking between rings will not occur as long as the sum of all of the MAX_BLKS
thresholds does not exceed the size of the reserved buffer pool. The size of the undifferentiated
and reserved pools is configured via the UD_BLOCKS setting in the MPIPE_EDMA_CTL register.
6.5.6
Notifications
There are several types of eDMA-complete notifications to provide software with flow control and
buffer-complete information.
6.5.6.1
Descriptor Ring Head
As descriptors are consumed by the eDMA descriptor fetch engine, the hardware updates the
head pointer. The head pointer is accessed by an MMIO read to the “eDMA Ring” STRUCT.
This mechanism allows software to know when the ring is full. There is no interrupt associated
with descriptor after it has been read. It is assumed that software will either periodically read the
head pointer or use the descriptor-complete interrupt/counter described below.
130
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Egress Services
6.5.6.2
Descriptor Complete Interrupt and Counter
The descriptor ring head pointer provides an indication of what descriptors have been fetched
from the ring, but it does not indicate that the transaction associated with the descriptor has
completed.
When software needs to know that the transactions associated with an eDMA descriptor have
completed, the DescComplete interrupt and DescriptorCompleteCount are used. These
indicate that all memory associated with the descriptor has been accessed and the buffer can be
reused. Each ring contains a 16-bit rolling count of descriptors that have been processed. This
counter can be masked based on the ring size to determine which descriptors have been completely processed.
The DescriptorCompleteCount is directly accessible via an MMIO read to the “eDMA Ring”
structure.
The DescriptorComplete mechanism can be used when buffers are being managed by software rather than being returned directly to the buffer stack manager.
6.5.7
Checksum
The egress mPIPE supports checksum offload typically used for the TCP body. The eDMA packet
descriptor contains a checksum seed, start octet, total octet count, and destination start octet. As
the eDMA engine writes data into the ePKT buffer, the checksum is tracked and updated.
6.5.7.1
eDMA Checksum Buffer Limitations
Since the checksum result is typically stored in the header, the entire packet must be read before
the packet can begin egress to the MAC. Thus buffering must be provided for packets undergoing
checksum. Multi-channel implementations might provide insufficient buffering for checksum of
jumbo frames.
Attempting to checksum a packet larger than the cut through size will result in a corrupted
checksum.
Implementation Note: The TILE-Gx36 device provides a 195KB ePKT buffer dynamically partitioned
between all active egress rings. A four-ring implementation can provide checksum offload for all
frame sizes. However, since line-rate performance requires double-buffering, a 24 ring configuration
can only provide hardware checksum on 4KB byte frames (4KB*2*24 = 195KB).
Descriptors with checksum enabled must follow these rules:
•
The CSUM_START field specifies the first byte to include in the checksum. All bytes from
CSUM_START to the end of the packet will be included.
•
The CSUM_DEST field specifies the target of the first (more significant) byte of the checksum
result.
•
At most one eDMA descriptor per packet can have its CSUM_ENA bit set.
•
For descriptors with CSUM_ENA=0, CSUM_START, CSUM_DEST, and CSUM_SEED must be zero.
•
CSUM_START or CSUM_DESTcan specify a byte beyond the current eDMA descriptor. The
checksum engine will wait until CSUM_START bytes have been collected across as many
descriptors as necessary before beginning the checksum.
•
If the total packet size is larger than the cut-through threshold, only the bytes up to the threshold will be included in the checksum operation.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
131
Chapter 6 mPIPE Architecture
•
The CSUM_SEED is formatted in network byte order. In other words, bits[7:0] of CSUM_SEED
will be added to the byte specified by CSUM_START. The resulting checksum will be inverted
and placed into CSUM_DEST with bits[7:0] going into the byte pointed to by CSUM_DEST and
bits[15:8] going into the byte pointed to by CSUM_DEST+1.
•
The CSUM_DEST and CSUM_START fields do not include bytes that are part of a NoSend=1
descriptor. In other words, if CSUM_DEST is set to 100 with NoSend=1 (and boundary=0), the
CSUM_DEST will be 100 bytes into the next descriptor that has NoSend=0.
6.5.8
Egress Picker
Each egress channel can generate back pressure at any time – even within a packet. To maintain
line rate service on other channels, an egress picker matches available channels from the ePkt buffer to non-blocked egress channels. This provides low latency response to flow control events (for
example pause frames, priority pause frames, and channel flow control) while keeping the link(s)
saturated.
6.5.8.1
Egress Priority Arbitration
Each eDMA ring has a programmable priority level in the PRIORITY_LVL bit of the MPIPE_EDMA_RG_INIT_DAT_MAP register. This field selects one of three priority levels for the ring. The
egress arbiter chooses round-robin within each priority level and maintains strict priority
between the levels. Thus rings set to level-2 will always have priority over rings set to levels 0 and
1.
The mPIPE egress picker also provides bandwidth controls for each priority level. This can be
used for basic rate shaping and prevention of starvation between the strict priority levels. The
bandwidth controls are located in the MPIPE_EDMA_BW_CTL register.
A token bucket scheme is used wherein each of the three egress priority levels is provided with a
token bucket. A token for the ring’s priority level must be available in order for a packet to begin
sending. Each time a 128-byte block is sent, a token is consumed. The token buckets for each priority level are replenished based on the settings in this register. These register settings control the
rate at which tokens are replenished for each priority level. Each unit represents approximately
6*LINE_RATE/(N+1) where LINE_RATE includes packet overhead.
When set to 0, tokens are replenished as fast as they can be consumed, hence setting all PRIORn_RATE values to zero will revert to a strict-priority scheme. Note that the packet arbiter can
run faster than line rate since there is buffering in the egress path. For this reason, settings that
allow the bandwidth to exceed LINE_RATE are meaningful. The setting for PRIOR2 is typically
higher than PRIOR1 and PRIOR0 in order to prevent starvation. Similarly, PRIOR1 is typically set
higher than PRIOR0.
The LINE_RATE divider setting in the MPIPE_EDMA_BW_CTL register can be used to set the coarse
granularity for the expected egress bandwidth and the PRIORn_RATE settings fine tune the bandwidth. Additionally, the BURST_LENGTH bit determines the hysteresis of the token buckets.
6.5.8.2
Egress Priority Flow Control
Each eDMA ring has a programmable mask in the PRIORITY_QUEUES bit of the MPIPE_EDMA_RG_INIT_DAT_MAP register indicating which priority queue(s) will be sent using the ring.
When a MAC applies back pressure to a particular priority queue using the priority-pause mechanism, any rings with the associated mask bit set will be back-pressured. Software must ensure that
packets targeting a given priority queue are sent using a ring that has the associated PRIORITY_QUEUES bit set.
132
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Virtual Memory
6.5.9
Special Flows
The eDMA engine provides special flows to support packet drops and loopback.
6.5.9.1
NoSend Option
If the system requires packets to be dropped but still wants to return buffers to the buffer stack
manager, the NoSend mode can be used. When the NS bit is set in the eDMA descriptor, the
eDMA command will be processed as normal, except no data will be transferred to the MAC. All
buffers will be returned as specified in the eDMA and buffer descriptors, and the notification and
HWB (buffer return) fields in each buffer descriptor are still honored.
All checksum fields in NoSend descriptors must be zero.
The boundary bit is ignored and treated as if zero on NoSend descriptors. Thus the last descriptor
for a packet must move at least one byte – a NoSend descriptor can not be used to terminate a
“real” packet.
If application software has reserved a slot in an eDMA ring but does not want to send any packet
and does not wish to return any buffers, it can post a descriptor with NoSend=1 and Size=0.
Note that descriptors with NoSend=1 and a nonzero Size field must always follow a descriptor
with Boundary=1 or another NoSend=1 descriptor. In other words, a NoSend=1 with Size!=0
cannot be used in the middle of a “normal” set of descriptors.
6.5.9.2
Size=0 Option
Descriptors with size=0 are No-Ops. These are typically used to fill a slot in the ring that is not
going to be used and needs to be skipped over. Size=0 descriptors should have all CSUM-related
fields set to 0.
6.5.9.3
eDMA Loopback
The ingress flow has channels dedicated to receiving data looped back from eDMA. This allows
the high performance eDMA gather portion of the mPIPE to feed packet data into the iDMA classification, load-balance, and distribution flows. This option is used, for example, to treat data
moved from a PCI interface as packet data and distribute to worker Tiles as if it had arrived on a
packet interface.
Bandwidth that is used for eDMA loopback is unavailable for normal egress traffic. So, for example, if a 40Gbps implementation is using 20Gbps of loopback bandwidth, then the egress channels
will only support 20Gbps. Typically this restriction is manifested as a system requirement based
on the amount of ingress PDA offload supported – in this example, 40Gbps.
The PRIORITY_QUEUES bit of the MPIPE_EDMA_RG_INIT_DAT_MAP register determines which
priority queue packets on an eDMA loopback channel to consume. The highest set bit in the PRIORITY_QUEUES mask is decoded as the priority queue.
Implementation Note: The TILE-Gx36 device provides four dedicated loopback channels.
6.6 Virtual Memory
The mPIPE supports virtual addressing on all structures with which software interacts. Each buffer stack is associated with an address space identifier (ASID). Each ASID provides a set of VA-toPA translations through the I/O TLB.
Implementation Note: The TILE-Gx36 device supports 32 ASIDs and 16 TLB entries per ASID. Each
buffer stack is statically assigned to an ASID.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
133
Chapter 6 mPIPE Architecture
Buffer descriptors are associated with a single buffer stack and hence a specific virtual address
space. As buffer descriptors are processed by the mPIPE, the VA is converted to a PA by searching
for a valid translation within the set of TLB entries associated with the buffer stack’s ASID.
The buffer stacks themselves exist in physical memory with a set of attributes setup by configuration software. The application software’s only interaction with the stack is through the buffer
stack manager so no VA-to-PA translation is necessary – the PA space of the stack is essentially private to the stack engine.
Software provides buffer descriptors as part of the eDMA packet descriptor. In order to prevent
software-generated buffer descriptors from accessing unauthorized buffer stacks and the associated ASID, each eDMA descriptor ring provides a protection mask, which indicates which buffer
stack(s) the eDMA ring is allowed to access.
Similarly, software configures each notification ring and eDMA descriptor ring with a set of physical memory attributes including start-PA, homing information, and caching hints. The ring must
reside in contiguous physical memory. eDMA rings are configured using the MPIPE_EDMA_DM_INIT_CTL/MPIPE_EDMA_DM_INIT_DAT registers.
The ring protection mask along with the eDMA ASIDs are configured using the MPIPE_EDMA_DM_INIT_CTL/MPIPE_EDMA_DM_INIT_DAT registers.
It is up to software to map the rings into virtual address space(s) for the worker Tiles. Communication between the worker and the mPIPE related to notification and eDMA rings uses head/tail
pointers relative to the start of the ring so no translation is necessary.
6.6.1
I/O TLB Details
System software is responsible for TLB fault handling. The following guidelines and properties
apply to mPIPE’s TLB structure:
134
•
eDMA and iDMA share a common I/O TLB. Each has a micro-TLB, which is not software visible.
•
The micro-TLBs can be flushed by writing a 1 to the MTLB_FLUSH bit of the MPIPE_TLB_CTL
register. This is used for shooting down TLB entries or ensuring that the micro-TLB is a subset
of the main TLB on replacement.
•
eDMA and iDMA each have their own interrupt binding (one binding each). The interrupts are
both in vector-0: IDMA_TLB_MISS and EDMA_TLB_MISS.
•
On a miss, the relevant information is placed into the MPIPE_TLB_IDMA_EXC and MPIPE_TLB_EDMA_EXC registers.
•
For iDMA, the DMA engine will stall on a fault unless the associated ASID is in flush-on-fault
mode as configured in the MPIPE_IDMA_ASID_FAULT_MODE register.
•
If ASIDs are configured for iDMA flush-on-fault, it is possible for multiple misses to be
reported. In this case, only the most recent fault will be captured in the MPIPE_TLB_IDMA_EXC
register.
•
For eDMA, the DMA descriptor that faults is retried. Other descriptor rings will continue to
make progress.
•
It is possible for multiple eDMA rings to report faults. Only the most recent fault will be
recorded in the EDMA_TLB_MISS interrupt.
•
System software must handle iDMA faults immediately if flush-on-fault mode is not being
used. Otherwise, packets will be lost as soon as the iPkt buffer overflows. Fault handling time
must be on the order of 1-5 microseconds.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
PA Distribution
•
eDMA faults will result minor lost performance in the eDMA engine as the faulting rings will
retry until the fault is handled. In most cases, this loss will not be perceivable unless many rings
are faulting while one ring is trying to achieve significant small-packet performance.
6.7 PA Distribution
The mPIPE uses the same algorithm as the Tile to assign a cacheline to a home Tile based on the
physical address. Each translation region, stack, and ring specifies if its associated physical memory is homed on a specific Tile or hashed-for-home across a set of Tiles.
Every mPIPE function that accesses Tile physical memory space contains a HFH table that translates a physical address into a home Tile. These tables are typically setup once at mPIPE
initialization time. The tables are accessed via the MPIPE_HFH_INIT_CTL/MPIPE_HFH_INIT_DAT registers.
6.7.1
Locality Hints
Each TLB entry has a NonTemporal hint used to indicate the locality properties of data contained
on the page. For packet data that is likely to be accessed within a relatively short period of time,
the NT hint should be set to 0. This will cause packet data writes to be marked as MRU (most
recently used) in the home Tile’s cache and hence be less likely to be displaced. For packet data
that is not likely to be accessed within a relatively short period of time, the NT hint bit should be
set to 1. This will cause the packet data to be marked as LRU (least recently used) and will be
more likely to be displaced thus reducing the cache footprint of streaming packet data.
Table 6-13 summarizes the caching characteristics of packet read/write data based on the locality
hint and state of the cache.
Table 6-13. NonTemporal Hint Behavior
Operation
Non-Temporal Hint
Cache State
Behavior
Write
0
Miss
Block is allocated in cache and marked as MRU.
Hit
Block is updated in cache and marked as MRU.
Miss
Block is written to main memory and NOT allocated in the
cache.
Hit
Block is written to the cache but the LRU state is not updated.
Miss
Data is fetched from main memory but not allocated in the
cache.
Hit
Data is fetched from the cache but the LRU state is not
updated.
Miss
Data is fetched from main memory but not allocated in the
cache.
Hit
Data is fetched from the cache and the cacheline is marked
clean. NOTE: this results in the contents of the cacheline
becoming unpredictable to Tile software so this must only be
used if there are no Tile consumers of the packet data after
egress.
1
Read
0
1
For packet data writes, the NonTemporal hint comes from the associated TLB entry. However,
the bit setting of the MPIPE_IDMA_CTL register can be used to override the setting used for the
first N blocks of each packet.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
135
Chapter 6 mPIPE Architecture
For packet data reads, the NonTemporal hint can be configured on a per-ring basis to be sourced
from the TLB entry or fixed. This setting is in the MPIPE_EDMA_RG_INIT_DAT register for each
ring.
For the NotifRings and BufferStacks, the NonTemporal hint is programmable in the associated setup registers. For descriptor reads, the NonTemporal hint is always 0.
6.7.2
Pinning
For I/O writes, an optional pinning attribute can be assigned on a per-TLB entry or per structure
(NotifRing, buffer stack, etc.) basis. When asserted, the Tile’s I/O-pinned ways will be used for
write data on miss. This reduces the cache footprint of the associated I/O data for applications
that require explicit cache control.
I/O reads do not use the pinning attribute since misses at the home Tile never install in the cache.
6.8 MMIO
Communication with the mPIPE uses MMIO. The physical address is broken into fields that map
the address into the various mPIPE regions.
2))6(7
5(*,21
5HVHUYHG[
69&B'20
5HVHUYHG[
Figure 6-19: MMIO Physical Address
The offset field is interpreted as described in Table 6-14:
Table 6-14. MMIO Physical Address Bit Descriptions
Bits
Name
Type
Reset
Description
39:36
SVC_DOM
RW
0
This field of the address indexes the service 16 entry domain table.
28:26
REGION
RW
0
This field of the address selects the region (address space) to be accessed.
25:0
6.8.1
OFFSET
RW
0
Value
0
Name
CFG
4
5
6
IDMA
EDMA
BSM
Meaning
Access to Configuration space. Protection level is
provided by the service domain vector.
Access to iDMA NotifRing and Bucket release.
Access to eDMA descriptor rings.
Access to the buffer stack manager
This field of the address provides an offset into the region being accessed.
MAC Configuration Registers
Access to each MACs’ configuration space is via the MMIO configuration region for mPIPE. The
address for configuration space is described in Figure 6-20 and defined in Table 6-4.
136
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
MMIO
Figure 6-20: Configuration Address Format
Table 6-15. Configuration Address Format Bit Descriptions
Bits
Name
Type
Reset
Description
21
INTFC
RW
0
Interface being accessed.
Value Name
Meaning
0
mPIPE Access to mPIPE registers
1
MAC
Access to MAC registers
20:16
MAC_SEL
RW
0
Selects the MAC being accessed when bit[21] is 1.
15:0
ADDR
RW
0
Register Address.
6.8.2
Service Domains
The mPIPE provides 32 independent service domains. Each service domain allows or disallows
access to specific configuration domains, groups of buffer stacks, groups of NotifRings and
buckets, and eDMA rings.
Each service domain has an entry in the service domain table. A table entry consists of bits associated with services within the mPIPE. When a bit is set, access is allowed. When clear, MMIO
writes will be ignored and reads will return unpredictable data.
When an access is rejected due to a service domain check, an error response is returned to the
requesting Tile. This results in an asynchronous MMIO_ERROR interrupt at the requesting Tile.
Accesses that touch multiple service domains, such as the combined NotifRing/Bucket
release, will be completely rejected if any individual check fails.
The service domain table (Table 6-17) is accessed via the MPIPE_MMIO_INIT_CTL/MPIPE_MMIO_INIT_DAT registers. By default, all services are enabled for all service domains.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
137
Chapter 6 mPIPE Architecture
Table 6-16. MMIO Service Domain Table Entry
Bits
31:0
Description
Allow access to NotifRings for releases. The NotifRings are divided into 32 regions based on the MSBs of
the NotifRing Index. Each bit corresponds to one of these regions. For example, if bit[6] is set, then
access is allowed to any NotifRing with an index whose 5 MSBs are 00110(b).
Implementation Note: On the TILE-Gx36, there are 256 NotifRings so each bit corresponds to the encoding of NotifRingIndex[7:3]. For example, NotifRings 0-7 are represented by bit[0]. NotifRings 8-15 are represented by bit[1] etc.
63:32
Allow access to Buckets for releases. The upper 16 bits are used to provide access to the upper “non
power of 2” buckets. The lower 16 bits provide access to the lower “power of 2” buckets. See the register
spec for details.
Implementation Note: On the TILE-Gx36, there are 4160 Buckets so each bit in the lower 16 bits corresponds to the encoding of BucketID[11:8] and is used when BucketID[12] is 0. When BucketID[12] is 1,
the upper 16 bits of the vector are used and are index be BucketID[5:2].
95:64
Allow access to Buffer Stacks for buffer releases and fetches.
Implementation Note: On the TILE-Gx36, there are 32 Buffer Stacks so each bit is associated with a single stack.
119:96
Allow access to eDMA Rings for descriptor posts and head pointer reads.
121:120
Configuration protection level. An access to a service domain set to Level-2 can access registers at 2 and
below. Level-1 can only access level-1 and below. Level-0 can only access level-0 registers.
6.9 Interrupts
Interrupts are sent to Tile software using the IPI mechanism. An interrupt consists of a target
TileID, InterruptNum, and EventNum. The TileID is the Tile receiving the interrupt, the
InterruptNum selects one of the four IPI interrupt levels at the Tile, and the EventNum is the
event number within the specified interrupt level.
The mPIPE provides a binding for each interrupt that it generates. A binding consists of the TileID, InterruptNum, and EventNum.
The mPIPE Interrupts are summarized in Table 6-17.
Table 6-17. mPIPE Interrupts
138
Int Name
Description
Interrupt VecNum.BitNum
BSM_BAD_VA
Buffer stack manager received a buffer post with a
bad VA
0.0
BSM_LIM_ERR
Attempt to post buffers beyond the size of the associated buffer stack
0.1
CLS_TINT
Classifier wrote the TileInt SPR
0.2
EDMA_POST_ERR
Post received but no valid descriptor
0.3
IDMA_CTR_OVFL
Counter Overflow
1.[0-31]
EDMA_DESC_COMP
eDMA Descriptor complete
2.[0-23]
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
UserIO
Each interrupt is assigned to a vector, which can be used to clear the interrupt status. Each vector
provides access via a Read/Write-one-to-clear address and a ReadToClear address. The vector
registers are MPIPE_INT_VECx_W1TC and MPIPE_INT_VECx_RTC (for example,
MPIPE_INT_VEC1_W1TC) registers.
Each interrupt binding of the registers are MPIPE_INT_BIND and contain the following fields:
•
Enable Bit. When 0, the interrupt will not be sent. The status bit will still be updated.
•
Mode Bit. When 1, interrupt will be dispatched each time it occurs. When set to 0, the interrupt
is only sent if the status bit is clear.
•
TileID(8)
•
IntNum(2)
•
EventNum(6)
6.10UserIO
Bulk data transfer to and from the mPIPE uses the Tile memory system and hence provides direct
user access via virtual memory. Communication with the mPIPE is through the MMIO interface
(see Section 6.6.1 I/O TLB Details). System software controls access to the mPIPE via page table
mappings. Hence process-level protection and user access is provided as part of the virtual memory system.
Interrupts are delivered via the IPI and are configured through bindings on each structure that
generates interrupts such as eDMA rings, buffer stacks, and various other exceptions.
6.11Flush Mechanisms
When an application crashes or needs to be restarted, related resources in the mPIPE need to be
flushed before they can be reallocated for another/new client. For example, there can be writes
inflight to the application’s NotifRing(s) or descriptors pending processing for the application’s
eDMA Ring(s). The mPIPE provides flush/drain mechanisms to aid system software in cleaning
up a particular flow without disturbing the performance other flows.
6.11.1 MMIO Access Drain
An application can have MMIO loads and stores outstanding to the various mPIPE structures
including the buffer stack manager, load balancer, eDMA rings, or config space. An MF executed
on the application’s Tile(s) will guarantee that all MMIO transactions have completed and any
interrupts associated with those transactions will have been posted to the Tile.
6.11.2 NotifRing Drain
Before it can drain an application’s NotifRing(s), system software must first make sure no new
packets will target the NotifRing. The procedure for doing this depends on the system configuration and can require updating the classifier program, the load balancer configuration, or both.
The NotifRing’s COUNT field of the MPIPE_LBL_INIT_DAT_NR_TBL_1 register can also be written to 0xfffe to force all incoming packets that target the NotifRing to be dropped. Setting the
count must be done if the classifier program forces the NotifRing field in the descriptor since
the group and bucket settings have no effect in that case. This is useful if the NotifRing is going
to be immediately restarted after being drained.
Note that the load balancer must be temporarily frozen via the FREEZE bit of the MPIPE_LBL_CTL
while making configuration changes to the load balancer.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
139
Chapter 6 mPIPE Architecture
Once the NotifRing has been configured to no longer receive packets, software must poll the
MPIPE_LBL_INIT_DAT_NR_INFL_CNT for the associated NotifRing until it reaches zero. This
insures that there are no data, descriptor, or tail-pointer writes inflight to the associated
NotifRing.
Any remaining interrupts for the NotifRing can be cleared by reading the interrupt status vector
from the Tile bound to the associated interrupt. An MF after this read will insure that all interrupts
for the NotifRing have been delivered to the Tile.
In summary, to drain a NotifRing software must:
1.
Freeze the load balancer.
2.
Remove NR from group and buckets. This includes zeroing the bucket count on any DFA
buckets and reassigning the bucket on any FIXED or STICKY mode buckets.
3.
Optionally set NotifRing’s COUNT to 0xfffe (this is done if packets are expected to still be
getting assigned to the NotifRing’s group).
4.
Un-Freeze load balancer.
5.
If the classifier needs to be reprogrammed:
a. Reprogram the classifier and poll the PGM_PND bit of the MPIPE_CLS_ENABLE register until
clear.
b. Set the CLS_FENCE bit in MPIPE_CLS_CTL register and poll until clear to be sure all packets
using the old program have been delivered to the load balancer.
c. At this point, no new packets will be targeting the NotifRing.
6.
Poll MPIPE_LBL_INIT_DAT_INFL_CNT until it reads zero.
7.
Read Interrupt Status register (INT_VEC*_RTC, for example the MPIPE_INT_VEC0_W1TC register) from the bound Tile.
8.
MF.
6.11.3 Ingress Channel Drain
If a link goes down, the MAC will automatically terminate any inflight packets with a MACERROR status. Software can insure that all packets for the MAC and associated channel(s) have
drained by polling for the per-channel iPKT counts to reach zero. The per-channel counters are
accessed by setting the DIAG_CTR_SEL bit in the MPIPE_IDMA_CTL register to CHANNEL and
then reading the DIAG_CTR_VAL bit of the MPIPE_IDMA_STS register.
6.11.4 EDMA Ring Drain
Before it can drain an eDMA ring, software first freezes the ring to prevent new descriptors from
being fetched. It then must set the FLUSH bit for the ring to drain any data from the ePkt buffer.
Once the buffer is completely flushed, the ring state must be re-initialized.
Flushing the ring can result in corrupted packets on the MAC interface since partially buffered or
even partially sent packets might need to be truncated. The MAC will insert bad CRC into flushed
packets to insure the packet is discarded at the receiving node.
The steps required are shown below:
1.
140
Set the FLUSH and FREEZE bits in the ring’s MPIPE_EDMA_DM_INIT_DAT_SETUP register.
This will prevent new descriptors from being fetched and prevent buffered-descriptors from
being processed.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Flush Mechanisms
2.
Issue a memory fence.
3.
Poll the FLUSH_PND bit in the MPIPE_EDMA_CTL register until it is clear. This clears out existing packet data for the ring and allows pending requests to complete.
4.
Issue a memory fence.
5.
Set the FENCE bit in the MPIPE_EDMA_CTL register.
6.
Issue a memory fence.
7.
Poll the FENCE bit in the MPIPE_EDMA_CTL register until it is clear. This clears any remaining
requests.
8.
Poll the FLUSH_PND bit of the MPIPE_EDMA_CTL register until clear. This clears remaining
data from any drained requests.
At this point, the ring has been flushed.
9.
Get the descriptor complete count from COUNT bits of the MPIPE_EDMA_POST_REGION_VAL
register. This is the rolling count used for this DMA ring that indicates how many descriptors
were processed (and if it had any associated buffers returned).
10. Set the ring’s head pointer to zero via the HEAD bits of the MPIPE_EDMA_DM_INIT_DAT_HEAD
register.
11. Set MPIPE_EDMA_DM_INIT_DAT_DESC_STATE0 to 1 and MPIPE_EDMA_DM_INIT_DAT_DESC_STATE1 to 0 to return the descriptor fetch to its initial state.
12. Flush any outstanding interrupts by reading the associated interrupt vectors.
13. Issue a memory fence.
At this point, the ring can be reused.
The procedure described above will flush a ring regardless of any backpressure being received
from the MAC, corrupted descriptors, or descriptors with TLB misses. Before it can reduce or
eliminate having corrupted (truncated) packets on the MAC, software must first allow normallyflowing packets and descriptors to complete.
This is done by setting the ring’s FREEZE bit but not its FLUSH bit. Then the descriptor-complete
count can be compared to the head pointer to see when all of the ring’s outstanding descriptors
have been completed. Finally, the associated ring’s ePKT block counter can be read to see that it is
empty. This provides a cleaner shutdown for rings that are otherwise behaving properly.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
141
Chapter 6 mPIPE Architecture
142
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 7
XAUI MAC I NTERFACE
7.1 Introduction
The TILE-Gx™ XAUI MAC and PCS provide a 10Gb/s Ethernet interface to mPIPE™. Configuration of the MAC is through the mPIPE MMIO space.
7.1.1
Features
•
Compatible with IEEE Standard 802.3
•
May be configured as four SGMII ports or single XAUI port
•
Optional double-rate XAUI mode for 20Gbps operation using four lanes at 6.25Gbps (see data
sheet for specific device support)
•
Custom modes for lower overhead and higher throughput
•
Supports 802.1Qbb priority-based flow control
•
High precision timestamping and IEEE 1588
•
MDIO and Interrupt interfaces to off-chip PHYs
•
Configurable CRC
•
Multiple loopback modes and pattern generators for in system test and characterization
•
Independent polarity reversal on TX and RX
•
Support for 802.3az Energy Efficient Ethernet
7.2 Register Spaces
Access to the XAUI MAC is via the mPIPE’s MAC interface in MMIO space. The format for the
physical address is shown in Figure 7-1.
5(*
0$&B6(/
,17)&
5HVHUYHG[
5(*,21
5HVHUYHG[
69&B'20
5HVHUYHG[
Figure 7-1: Physical Address Format
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
143
Chapter 7 XAUI MAC Interface
Table 7-1. Physical Address Format Bit Descriptions
Bits
Name
Type
Reset
Description
39:35
SVC_DOM
RW
0
This field of the address indexes the 32 entry service domain table.
28:26
REGION
RW
0
This field of the address selects the region (address space) to be
accessed. For the config region, this field must be 0.
21
INTFC
RW
0
Interface being accessed.
Value
Name
Meaning
0
MPIPE
Access to MPIPE registers
1
MAC
Access to MAC registers
20:16
MAC_SEL
RW
0
Selects the MAC being accessed when bit[21] is 1.
15:0
REG
RW
0
Register address.
Registers in the XAUI MAC are all 8-bytes. Accesses smaller than 8-bytes are not supported and
will result in an MMIO error returned to the requesting Tile.
7.3 MAC and Channel Mapping
MACs are assigned MAC-Numbers in hardware. This number is used in the MAC_SEL field of the
MMIO address when accessing MAC registers. System software may perform MAC “discovery”
by reading the MPIPE_XAUI_MAC_INFO register that is always at address 0x0000 in each MAC’s
address space.
Each XAUI MAC is also assigned to a specific hardware channel in mPIPE. This mapping is provided in the CHANNEL bit of the MPIPE_XAUI_MAC_INFO register and is also described by the
mPIPE’s MPIPE_MACn_MAP registers.
The channel numbers are assigned to eDMA rings by system software. See “Ring to Channel Mapping” on page 127.
7.4 Port Configuration
The XAUI port is enabled via the MPIPE_MAC_ENABLE register. Basic MAC settings are configured in the MPIPE_XAUI_TRANSMIT_CONTROL and MPIPE_XAUI_TRANSMIT_CONFIGURATION
registers and MPIPE_XAUI_RECEIVE_CONTROL and MPIPE_XAUI_TRANSMIT_CONFIGURATION
registers.
Once they are enabled, the link status can be monitored through the MPIPE_XAUI_PCS_STS
register.
A port that is disabled will automatically turn off the SERDES and operate in a low power mode.
MAC registers, interrupts, and MDIO functions are still accessible when a port is disabled.
For XAUI ports that support double-rate mode (see TILE-Gx36 Data Sheet (DS400)), the DOUBLE_RATE bit of the MPIPE_XAUI_PCS_CTL register must be written prior to enabling the port
via the MPIPE_MAC_ENABLE.
144
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Flow Control
7.4.1
Lane Sharing with SGMII
The XAUI port may be reconfigured into four independent SGMII ports. This is controlled by the
MPIPE_MAC_ENABLE register. When one or more lanes are operating in SGMII mode, the XAUI
MAC is no longer used and instead the SGMII MACs control the lane. See Chapter 8: SGMII
MAC Interface.
The AVAIL bits of the MPIPE_XAUI_MAC_INFO register and MPIPE_MAC_ENABLE registers indicates whether the port is able to be used in the device’s configuration.
7.5 Flow Control
The XAUI MAC supports standard 802.3 pause-based flow control as well as 802.Qbb prioritybased flow control. The MAC can auto-generate either pause type based on configurable high
water marks in the mPIPE iPKT buffer. The reception of pause frames triggers a pause condition
on one or more TX queues. These queues can be mapped into the mPIPE eDMA arbiter to block
traffic from one or more rings that target the MAC.
mPIPE supports up to 16 priority queues in the iPKT buffer. Each has a programmable high water
mark. The MPIPE_PR_PAUSE_THR registers contain the programmable high water marks. The
lower 8 priority queues can be mapped directly to the eight 802.1Qbb PFC queues. The upper 8
priority queues are used for 802.3 pause and/or mPIPE loopback channels.
The PAUSE_MODE bit of the MPIPE_XAUI_MAC_INTFC_CTL register controls the type of pause
frame to be sent when priority queues become full.
7.5.1
Priority-Based Flow Control
Incoming RX packets are assigned to a priority queue based on the data extracted from the VLAN
priority tag as per IEEE 802.1Qbb. The RX queue selection can be overridden via the PRQ_OVD
and PRQ_OVD_VAL bits of the MPIPE_XAUI_MAC_INTFC_CTL register.
As RX packets are written into their associated queues, a counter tracks the queue’s fullness. Once
the high water mark is reached, the MAC can automatically dispatch priority pause frames indicating back pressure on one or more queues. The MAC can be configured to monitor an number
of queues via the TX_PRQ_ENA bit of the MPIPE_XAUI_MAC_INTFC_CTL register.
The MPIPE_XAUI_MAC_INTFC_TX_CTL register contains settings to override the mPIPE eDMA
back pressure so that one or more queues can be manually paused or unpaused.
7.6 Interrupts
Each XAUI port has a dedicated interrupt input pin. This interrupt pin is intended for connection
to an external PHY. When this interrupt input is high, an interrupt is signaled to the PHY_INT bit
of the MPIPE_XAUI_INTERRUPT_STATUS register. This interrupt should typically be operated in
mode-0 since the interrupt is level-sensitive.
In addition to the external (PHY) interrupt, the XAUI MAC produces a number of interrupt conditions as described in the MPIPE_XAUI_INTERRUPT_STATUS register description.
7.7 Timestamping and IEEE 1588
The XAUI MAC supports IEEE 1588 frame recognition and generation for system wide time
correlation.
IEEE 1588 is a standard for precision time synchronization in local area networks. It works with
the exchange of special Precision Time Protocol (PTP) frames. The PTP messages can be transported over IEEE 802.3/Ethernet, over Internet Protocol Version 4 or over Internet Protocol
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
145
Chapter 7 XAUI MAC Interface
Version 6 as described in the annex of IEEE P1588.D2.1. Most 1588 functionality can be implemented in software but for greatest accuracy hardware assist is required to detect when PTP event
messages pass the GMII interface (clock time-stamp point).
The MAC detects when the PTP event messages: sync, delay_req, pdelay_req and pdelay_resp are
transmitted and received. The MPIPE_XAUI_TX_1588/MPIPE_XAUI_RX_1588 registers indicate
the message time-stamp point of PTP event frames. These timestamp registers can be correlated
back to the MPIPE_TIMESTAMP_VAL if desired.
Synchronization between master and slave clocks is a two stage process. First the offset between
the master and slave clocks is corrected by the master sending a sync frame to the slave with a follow up frame containing the exact time the sync frame was sent.
Hardware assist modules at the master and slave side detect exactly when the sync frame was sent
by the master and received by the slave. The slave then corrects its clock to match the master
clock. Second the transmission delay between the master and slave is corrected. The slave sends a
delay request frame to the master which sends a delay response frame in reply. Hardware assist
modules at the master and slave side detect exactly when the delay request frame was sent by the
slave and received by the master. The slave will now have enough information to adjust its clock
to account for delay.
For example if the slave was assuming zero delay the actual delay will be half the difference
between the transmit and receive time of the delay request frame (assuming equal transmit and
receive times) because the slave clock will be lagging the master clock by the delay time already.
For hardware assist it is necessary to time-stamp when sync and delay_req messages are sent and
received. The time-stamp is taken when the message time-stamp point passes the clock timestamp point. For Ethernet the message time-stamp point is the SFD and the clock time-stamp
point is the MII interface. (The 1588 spec refers to sync and delay_req messages as event messages
as these require time-stamping. Follow up, delay response and management messages do not
require time-stamping and are referred to as general messages.) 1588 version 2 defines two additional PTP event messages. These are the peer delay request (Pdelay_Req) and peer delay
response (Pdelay_Resp) messages. These messages are used to calculate the delay on a link.
Nodes at both ends of a link send both types of frames (regardless of whether they contain a master or slave clock). The Pdelay_Resp message contains the time at which a Pdelay_Req was
received and is itself an event message. The time at which a Pdelay_Resp message is received is
returned in a Pdelay_Resp_Follow_Up message. 1588 version 2 introduces transparent clocks
of which there are two kinds, peer-to-peer (P2P) and end-to-end (E2E). Transparent clocks measure the transit time of event messages through a bridge and amend a correction field within the
message to allow for the transit time. P2P transparent clocks additionally correct for the delay in
the receive path of the link using the information gathered from the peer delay frames. With P2P
transparent clocks delay_req messages are not used to measure link delay. This simplifies the protocol and makes larger systems more stable. The sof_tx and sof_rx signals are provided to
indicate the message time-stamp point and follow up signals are provided to indicate the presence of an event frame. With 1588 version 1 for a given data-rate the assertion of the event frame
signals will be a fixed delay after the sof signals so taking the time-stamp could be delayed until
the event signals are asserted and suitable compensation made.
The XGM recognizes seven different encapsulations for PTP event messages:
146
1.
1588 version 1 (UDP/IPv4 multicast)
2.
1588 version 2 (UDP/IPv4 multicast)
3.
1588 version 2 (UDP/IPv6 multicast)
4.
1588 version 2 (Ethernet multicast)
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Timestamping and IEEE 1588
5.
1588 version 1 (UDP/IPv4/VLAN multicast)
6.
1588 version 2 (UDP/IPv4/VLAN multicast)
7.
1588 version 2 (UDP/IPv6/VLAN multicast)
Example of a sync frame in the 1588 version 1 (UDP/IPv4) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 0800
IP stuff (Octets 14-22)
UDP (Octet 23) 11
IP stuff (Octets 24-29)
IP DA (Octets 30-32) E00001
IP DA (Octet 33) 81 or 82 or 83 or 84
source IP port (Octets 34-35)
dest IP port (Octets 36-37) 013F
other stuff (Octets 38-42)
versionPTP (Octet 43) 01
other stuff (Octets 44-73)
control (Octet 74) 00
other stuff (Octets 75-168)
Example of a delay request frame in the 1588 version 1 (UDP/IPv4) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 0800
IP stuff (Octets 14-22)
UDP (Octet 23) 11
IP stuff (Octets 24-29)
IP DA (Octets 30-32) E00001
IP DA (Octet 33) 81 or 82 or 83 or 84
source IP port (Octets 34-35)
dest IP port (Octets 36-37) 013F
other stuff (Octets 38-42)
versionPTP (Octet 43) 01
other stuff (Octets 44-73)
control (Octet 74) 01
other stuff (Octets 75-168)
For 1588 version 1 messages sync and delay request frames are indicated by the
XGM if the frames type field indicates TCP/IP, UDP protocol is indicated, the
destination IP address is 224.0.1.129/130/131 or 132, the destination UDP port is
319 and the control field is correct.
The control field is 0x00 for sync frames and 0x01 for delay request frames.
For 1588 version 2 messages the type of frame is determined by looking at the message type field in the first byte of the PTP frame. Whether a frame is version 1
or version 2 can be determined by looking at the version PTP field in the second
byte of both version 1 and version 2 PTP frames.
In version 2 messages sync frames have a message type value of 0x0, delay_req
have 0x1, pdelay_req have 0x2 and pdelay_resp have 0x3.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
147
Chapter 7 XAUI MAC Interface
Example of a sync frame in the 1588 version 2 (UDP/IPv4) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 0800
IP stuff (Octets 14-22)
UDP (Octet 23) 11
IP stuff (Octets 24-29)
IP DA (Octets 30-33) E0000181
source IP port (Octets 34-35)
dest IP port (Octets 36-37) 013F
other stuff (Octets 38-41)
messagetype (Octet 42) 00
version PTP (Octet 43) 02
Example of a pdelay_req frame in the 1588 version 2 (UDP/IPv4) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 0800
IP stuff (Octets 14-22)
UDP (Octet 23) 11
IP stuff (Octets 24-29)
IP DA (Octets 30-33) E000006B
source IP port (Octets 34-35)
dest IP port (Octets 36-37) 013F
other stuff (Octets 38-41)
messagetype (Octet 42) 02
version PTP (Octet 43) 02
Example of a sync frame in the 1588 version 2 (UDP/IPv6) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 86dd
IP stuff (Octets 14-19)
UDP (Octet 20) 11
IP stuff (Octets 21-37)
IP DA (Octets 38-53) FF0X000000000181
source IP port (Octets 54-55)
dest IP port (Octets 56-57) 013F
other stuff (Octets 58-61)
messagetype (Octet 62) 00
other stuff (Octets 63-93)
version PTP (Octet 94) 02
Example of a pdelay_resp frame in the 1588 version 2 (UDP/IPv6) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 86dd
IP stuff (Octets 14-19)
UDP (Octet 20) 11
IP stuff (Octets 21-37)
IP DA (Octets 38-53) FF0200000000006B
source IP port (Octets 54-55)
dest IP port (Octets 56-57) 013F
other stuff (Octets 58-61)
messagetype (Octet 62) 03
other stuff (Octets 63-93)
version PTP (Octet 94) 02
148
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Timestamping and IEEE 1588
Example of a sync frame in the 1588 version 2 (Ethernet multicast) format. For
the multicast
address 011B19000000 sync and delay request frames are recognized depending on
the
messagetype field, 00 for sync and 01 for delay request:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5) 011B19000000
SA (Octets 6 - 11)
Type (Octets 12-13) 88F7
messagetype (Octet 14) 00
version PTP (Octet 15) 02
Example of a pdelay_req frame in the 1588 version 2 (Ethernet multicast) format,
these need
a special multicast address so they can get through ports blocked by the spanning
tree
protocol. For the multicast address 0180C200000E sync, pdelay request and pdelay
response frames are recognized depending on the messagetype field, 00 for sync,
02 for
pdelay request and 03 for pdelay response.
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5) 0180C200000E
SA (Octets 6 - 11)
Type (Octets 12-13) 88F7
messagetype (Octet 14) 02
version PTP (Octet 15) 02
Also PTP frames encapsulated in UDP/IPv4 or IPv6 and VLAN are supported.
VLAN frames are indicated by 8100 for type field and the next type has to be IPv4
or IPv6.
Example of a sync frame in the 1588 version 1 (UDP/Ipv4/VLAN) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 8100
VLAN tag (Octets 14-15)
Type (Octets 16-17) 0800
IP stuff (Octets 18-26)
UDP (Octet 27) 11
IP stuff (Octets 28-33)
IP DA (Octets 34-36) E00001
IP DA (Octet 37) 81 or 82 or 83 or 84
source IP port (Octets 38-39)
dest IP port (Octets 40-41) 013F
other stuff (Octets 42-46)
versionPTP (Octet 47) 01
other stuff (Octets 48-77)
control (Octet 78) 00
other stuff (Octets 79-168)
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
149
Chapter 7 XAUI MAC Interface
Example of a delay request frame in the 1588 version 1 (UDP/Ipv4/VLAN) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 8100
VLAN tag (Octets 14-15)
Type (Octets 16-17) 0800
IP stuff (Octets 18-26)
UDP (Octet 27) 11
IP stuff (Octets 28-33)
IP DA (Octets 34-36) E00001
IP DA (Octet 37) 81 or 82 or 83 or 84
source IP port (Octets 38-39)
dest IP port (Octets 40-41) 013F
other stuff (Octets 42-46)
versionPTP (Octet 47) 01
other stuff (Octets 48-77)
control (Octet 78) 01
other stuff (Octets 79-168)
Example of a sync frame in the 1588 version 2 (UDP/IPv4/VLAN) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 8100
VLAN tag (Octets 14-15)
Type (Octets 14-17) 0800
IP stuff (Octets 18-26)
UDP (Octet 27) 11
IP stuff (Octets 28-33)
IP DA (Octets 34-37) E0000181
source IP port (Octets 38-39)
dest IP port (Octets 40-41) 013F
other stuff (Octets 42-45)
messagetype (Octet 46) 00
version PTP (Octet 47) 02
Example of a pdelay_req frame in the 1588 version 2 (UDP/IPv4/VLAN) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 8100
VLAN tag (Octets 14-15)
Type (Octets 16-17) 0800
IP stuff (Octets 18-26)
UDP (Octet 27) 11
IP stuff (Octets 30-33)
IP DA (Octets 34-37) E000006B
source IP port (Octets 38-39)
dest IP port (Octets 40-41) 013F
other stuff (Octets 42-45)
messagetype (Octet 46) 02
version PTP (Octet 47) 02
150
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
MDIO
Example of a sync frame in the 1588 version 2 (UDP/IPv6/VLAN) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 8100
VLAN tag (Octets 14-15)
Type (Octets 16-17) 86dd
IP stuff (Octets 18-23)
UDP (Octet 24) 11
IP stuff (Octets 25-41)
IP DA (Octets 42-57) FF0X0000000001B1
source IP port (Octets 58-59)
dest IP port (Octets 60-61) 013F
other stuff (Octets 62-65)
messagetype (Octet 66) 00
other stuff (Octets 67-97)
version PTP (Octet 98) 02
Example of a pdelay_resp frame in the 1588 version 2 (UDP/IPv6/VLAN) format:
Preamble/SFD 55555555555555D5
DA (Octets 0 - 5)
SA (Octets 6 - 11)
Type (Octets 12-13) 8100
VLAN tag (Octets 14-15)
Type (Octets 16-17) 86dd
IP stuff (Octets 18-23)
UDP (Octet 24) 11
IP stuff (Octets 25-41)
IP DA (Octets 42-57) FF020000000000B5
source IP port (Octets 58-59)
dest IP port (Octets 60-61) 013F
other stuff (Octets 62-65)
messagetype (Octet 66) 03
other stuff (Octets 67-97)
version PTP (Octet 98) 02
7.8 MDIO
An MDIO interface is provided to allow configuration of off an chip PHY. Multiple XAUI MACs
may share a single MDIO port depending on the device configuration (see datasheet). When more
than one port shares the MDIO, access is coordinated in software and enabled via the
MPIPE_MAC_MANAGE register.
The MAC’s MPIPE_XAUI_MDIO_CONTROL register is used to operate the MDIO interface.
7.9 Statistics
The XAUI MAC provides statistics counters for transmitted and received frames including byte
counts, frame counts, specific frame sizes, and specific frame types. These registers are in the
MAC starting with the MPIPE_XAUI_TRANSMITTED_OCTETS_LO register.
7.10Filtering
Incoming packets may be checked for an exact match against up to 8 MAC addresses in the
MPIPE_XAUI_EXACT_MATCH registers as well as a hashed match against the MPIPE_XAUI_RX_HASH registers (including the MPIPE_XAUI_RX_HASH_BOTTOM and MPIPE_XAUI_RX_HASH_TOP
registers). They can also be checked against a set of type-match registers.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
151
Chapter 7 XAUI MAC Interface
The address checking, or filter block determines which receive frames should be sent to mPIPE.
Whether a frame is sent or dropped depends on what is enabled in the receive configuration register, the contents of the specific address type match, and hash registers and the frame’s destination
address and type/length field.
The exact address match address is split between two registers, high and low. To enable or disable
exact address matching, write to the exact address registers: the low register to disable, the high
register to enable. The first six bytes (48 bits) of an Ethernet frame make up the destination
address. The first bit of the destination address (i.e. the LSB of the first byte of the frame) is the
group/individual bit. This is one for multicast addresses and zero for unicast. The all ones
address is the broadcast address and a special case of multicast.
The MAC supports recognition of eight specific addresses. Each specific address requires two registers, specific address register bottom and specific address register top. Specific address register
bottom stores the first four bytes of the destination address and specific address register top contains the last two bytes. The addresses stored can be specific, group, local or universal. See IEEE
Standard 802-2001, Clause 9 for a detailed description of 802 addressing.
7.10.1 Type ID Checking
The contents of the four type match registers are compared to the length/type ID in bytes 13 and
14 of received frames. If there is a match, the frames are copied into the RX FIFO. The following
example illustrates the use of the address and type ID match registers for a MAC address of
21:43:65:87:A9:CB.
Preamble 55
SFD D5
DA (Octet0 - LSB) 21
DA(Octet 1) 43
DA(Octet 2) 65
DA(Octet 3) 87
DA(Octet 4) A9
DA (Octet5 - MSB) CB
SA (LSB) 00
SA 00
SA 00
SA 00
SA 00
SA (MSB) 00
Type ID 43
Type ID 21
The sequence above shows the beginning of an Ethernet frame. Byte order of transmission is from
top to bottom as shown. For a successful match to specific address 1, the following address matching registers must be set up:
MPIPE_XAUI_EXACT_MATCH_BOTTOM_0 = 0x87654321
MPIPE_XAUI_EXACT_MATCH_TOP_0 0x0000CBA9
And for a successful match to type ID, the following type ID match register must be set up:
MPIPE_XAUI_TYPE_MATCH0 =
0x80004321
7.10.2 Broadcast Address
The broadcast address of 0xFFFFFFFFFFFF is recognized if the ‘disable broadcast’ bit in the
receive configuration register is zero. To enable type matching, set bit 31 in the type match registers.
152
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Special Modes
7.10.3 Hash Addressing
The hash address register is 64 bits long and takes up two locations in the memory map. The least
significant bits are stored in hash register bottom and the most significant bits in hash register
top. The unicast hash enable and the multicast hash enable bits in the network configuration register enable the reception of hash matched frames. The destination address is reduced to a 6 bit
index into the 64 bit hash register using the following hash function, which is an exclusive OR of
every sixth bit of the destination address.
hash_index[5]
hash_index[4]
hash_index[3]
hash_index[2]
hash_index[1]
hash_index[0]
=
=
=
=
=
=
da[5]
da[4]
da[3]
da[2]
da[1]
da[0]
^
^
^
^
^
^
da[11]
da[10]
da[09]
da[08]
da[07]
da[06]
^
^
^
^
^
^
da[17]
da[16]
da[15]
da[14]
da[13]
da[12]
^
^
^
^
^
^
da[23]
da[22]
da[21]
da[20]
da[19]
da[18]
^
^
^
^
^
^
da[29]
da[28]
da[27]
da[26]
da[25]
da[24]
^
^
^
^
^
^
da[35]
da[34]
da[33]
da[32]
da[31]
da[30]
^
^
^
^
^
^
da[41]
da[40]
da[39]
da[38]
da[37]
da[36]
^
^
^
^
^
^
da[47]
da[46]
da[45]
da[44]
da[43]
da[42]
In the hash function above, da[0] represents the least significant bit of the first byte received; that
is, the multicast/unicast indicator, and da[47] represents the most significant bit of the last byte
received. If the hash index points to a bit that is set in the hash register, the frame will be matched
according to whether the frame is multicast or unicast. A multicast match will be signalled if the
multicast hash enable bit is set: da[0] is 1 and the hash index points to a bit set in the hash register.
A unicast match will be signalled if the unicast hash enable bit is set: da[0] is 0 and the hash index
points to a bit set in the hash register. To receive all multicast frames, the hash register should be
set with all ones and the multicast hash enable bit should be set in the receive configuration
register.
7.11Special Modes
The MAC supports a number of custom modes that may be used for non-standard Ethernet or
optimized packet transport applications.
7.11.1 Pass All Frames Mode
In this mode, the RX filters are not applied and all frames are passed to mPIPE. This is typically
used in applications where the XAUI port is not terminating the stream (bump-on-wire, monitor
applications).
7.11.2 Custom Preamble
The standard Ethernet preamble can be overridden on TX to include custom bytes and passed
through to mPIPE on RX. This allows additional payload to be included with each packet without
changing the overall octet count for the frame. The NO_TX_PRE bit of the
MPIPE_XAUI_MAC_INTFC_CTL register is used to allow a custom preamble on transmit. And the
PASS_PREAMBLE bit of the MPIPE_XAUI_RECEIVE_CONFIGURATION register allows the preamble bytes to be to mPIPE.
Additionally, if custom protocols require CRC to cover the custom preamble bytes, the CRC may
be configured vie the PREAMBLE_CRC bit of the MPIPE_XAUI_RECEIVE_CONFIGURATION register
and PREAMBLE_CRC bit of the MPIPE_XAUI_TRANSMIT_CONFIGURATION register.
7.11.3 Short IPG
Additional byte-stuffing can be achieved by shortening the inter packet gap (IPG) on TX and RX.
On RX, the MAC can handle an IPG that has been shortened from an average of 12 down to an
average of 8. On TX, the inserted-IPG can be shortened using the DECREAS_IPG bit of the
MPIPE_XAUI_TRANSMIT_CONFIGURATION register.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
153
Chapter 7 XAUI MAC Interface
The IPG is used to compensate for link partners that have a small difference in their reference
clocks. When using a shortened IPG, it is up to the system designer to insure that sufficient IPG
remains to prevent overflow of either component’s elasticity FIFO.
7.12SERDES Control
The SERDES has programmable control for drive level, emphasis, equalization, PLL settings etc.
These SERDES settings are configured automatically by hardware based on port mode and power
up calibration. Settings may be overridden using the SERDES register interface in MPIPE_XAUI_SERDES_CONFIG. The SERDES configuration registers are not specified in the Tile Processor and I/
O Device Guide for the TILE-Gx Family of Processors (UG404).
7.13LEDs
Each XAUI MAC has a dedicated pair of LED outputs. These are typically driven automatically
based on link state and activity. But software may override the LED behavior and state via the
LED_MODE and OVD_VAL bit settings of the MPIPE_XAUI_PCS_CTL register.
154
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 8
SGMII MAC I NTERFACE
8.1 Introduction
The TILE-Gx™ GbE MAC and PCS provide an SGMII-based 1Gb/s Ethernet interface to mPIPE™.
Configuration of the MAC is through the mPIPE MMIO space.
For more information on mPIPE operations and features, refer to Chapter 6: mPIPE Architecture.
8.1.1
Features
The GbE MAC and PCS interface:
•
Is compatible with IEEE Standard 802.3
•
Provides an SGMII interface with CDR (inband clock recovery)
•
Supports for 10/100Mbps and half-duplex
•
Provides a PCS layer with auto-negotiation
•
Supports 802.1Qbb priority-based flow control
•
Supports Precision timestamping and IEEE 1588
•
Supports MDIO and Interrupt interfaces to off-chip PHYs
•
Supports multiple loopback modes for in-system test and characterization
•
Supports independent polarity reversal on TX and RX.
•
Provides support for 802.3az Energy Efficient Ethernet
8.2 Register Spaces
Access to the GbE MAC is via the mPIPE’s MAC interface in MMIO space. The format for the
physical address is shown in Figure 8-1 and described in Table 8-1.
Figure 8-1: TRIO Interface Format
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
155
Chapter 8 SGMII MAC Interface
Table 8-1. TRIO_CFG_REGION_ADDR Register Bit Descriptions
Bits
Name
Type
Reset
Description
39:35
SVC_DOM
RW
0
This field of the address indexes the 32 entry service domain table.
28:26
REGION
RW
0
This field of the address selects the region (address space) to be
accessed. For the config region, this field must be 0.
21
INTFC
RW
0
Interface being accessed.
Value
Name
Meaning
0
MPIPE
Access to MPIPE
registers
1
MAC
Access to MAC
registers
20:16
MAC_SEL
RW
0
Selects the MAC being accessed when bit[21] is 1.
15:0
REG
RW
0
Register address.
Registers in the GbE MAC are all 8-bytes. Accesses smaller than 8-bytes are not supported and
will result in an MMIO error returned to the requesting Tile.
8.3 MAC and Channel Mapping
MACs are assigned MAC-Numbers in hardware. This number is used in the MAC_SEL field of the
MMIO address when accessing MAC registers. System software can perform MAC “discovery” by
reading the MAC_INFO register, which is always located at address 0x0000 in each MAC’s address
space.
Each GbE MAC is also assigned to a specific hardware channel in mPIPE. This mapping is provided in the CHANNEL bit of the MPIPE_GBE_MAC_INFO register and is also described by the
mPIPE’s MPIPE_MACn_MAP registers (MPIPE_MAC0_MAP register, for example).
The channel numbers are assigned to eDMA rings by system software. See “Ring to Channel Mapping” on page 127.
8.4 Port Configuration
The GbE port is enabled via the MPIPE_MAC_ENABLE register. Basic MAC settings are configured
in the MPIPE_GBE_NETWORK_CONTROL and MPIPE_GBE_NETWORK_CONFIGURATION registers.
Once enabled, the link status can be monitored through the MPIPE_GBE_NETWORK_STATUS
register.
A port that is disabled will automatically turn off the SERDES and operate in a low power mode.
MAC registers, interrupts, and MDIO functions are still accessible when a port is disabled.
8.4.1
Lane Sharing with XAUI
XAUI ports can be reconfigured into four independent SGMII ports. This is controlled by the
MPIPE_MAC_ENABLE register. When one or more lanes are operating in SGMII mode, the XAUI
MAC is no longer used; the SGMII MACs control the lane instead. Refer to Chapter 7: XAUI MAC
Interface for more information.
The AVAIL bits of the MPIPE_GBE_MAC_INFO register and MPIPE_MAC_ENABLE registers indicates
whether or not the port is able to be used in the device’s configuration.
156
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Flow Control
Some SGMII control functions are still managed through the associated XAUI port’s configuration registers including:
•
TX/RX polarity (via the MPIPE_XAUI_PCS_CTL register)
•
SERDES controls (via the MPIPE_XAUI_SERDES_CONFIG register)
8.5 Flow Control
The GbE MAC supports standard 802.3 pause-based flow control as well as 802.Qbb prioritybased flow control. The MAC can auto-generate either pause type based on configurable high
water marks in the mPIPE iPKT buffer. The reception of pause frames triggers a pause condition on one or more TX queues. These queues can be mapped into the mPIPE eDMA arbiter to
block traffic from one or more rings that target the MAC.
mPIPE supports up to 16 priority queues in the iPKT buffer. Each has a programmable high
water mark. The MPIPE_PR_PAUSE_THR registers contain the programmable high water
marks. The lower 8 priority queues can be mapped directly to the eight 802.1Qbb PFC queues.
The upper 8 priority queues are used for 802.3 pause and/or mPIPE loopback channels.
The PAUSE_MODE bit of the MPIPE_GBE_MAC_INTFC_CTL register controls the type of pause
frame to be sent when priority queues become full.
8.5.1
Priority-Based Flow Control
Incoming RX packets are assigned to a priority queue based on the data extracted from the VLAN
priority tag as per IEEE 802.1Qbb. The RX queue selection can be overridden by setting the
PRQ_OVD or PRQ_OVD_VAL bits in the MPIPE_GBE_MAC_INTFC_CTL register.
As RX packets are written into their associated queues, a counter watches the queue to check its
level. Once the high water mark is reached, the MAC can automatically dispatch priority pause
frames, indicating back pressure on one or more queues. The MAC can be configured to monitor
an number of queues via the TX_PRQ_ENA bit of the MPIPE_GBE_MAC_INTFC_CTL register.
The MPIPE_GBE_MAC_INTFC_TX_CTL register contains settings to override the mPIPE eDMA
back pressure so that one or more queues can be manually paused or unpaused.
8.6 Interrupts
Each SGMII port has an associated GPIO pin that is designated for PHY interrupt use if needed. It
is up to the system software to direct this GPIO’s input to an interrupt targeting the associated
GbE software driver.
The GbE MAC produces a number of interrupt conditions as described in the MPIPE_GBE_INTERRUPT_STATUS register description.
8.7 Timestamping and IEEE 1588
The GbE MAC supports IEEE 1588 frame recognition and generation for system-wide time correlation. Operation of the 1588 frame recognition is similar to the way it is handled in XAUI MAC.
Refer to Chapter 7: XAUI MAC Interface for a description of frame recognition operations.
The GbE MAC has a dedicated timestamper in the MPIPE_GBE_1588_TIMER registers (including
the MPIPE_GBE_1588_TIMER_ADJUST register) and captures timestamps into the MPIPE_GBE_PTP registers (including the MPIPE_GBE_PTP_PEER_EVENT_FRAME_RECV_SECS register).
The GbE timestamp can be correlated to the mPIPE timestamper in software by performing iterative/alternating reads and using the timestamper adjustment controls.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
157
Chapter 8 SGMII MAC Interface
8.8 MDIO
An MDIO interface is provided to allow configuration of off an chip PHY. Multiple GbE MACs
may share a single MDIO port depending on the device configuration (see datasheet). When more
than one port shares the MDIO, access is coordinated in software and enabled via the
MPIPE_MAC_MANAGE register.
The MAC’s MPIPE_GBE_PHY_MAINTENANCE register is used to operate the MDIO interface.
8.9 10/100Mbps Support
The GbE MAC supports operation in 10/100Mbps modes via auto-negotiation or fixed configuration. The lower speed modes are achieved by replicating symbols on the link as per the SGMII
specification (see “Serial-GMII Specification, Revision 1.7” on page 584).
8.10Half-Duplex Support
Although the SGMII interface from TILE-Gx (MAC) to the PHY is always full-duplex, Half-duplex
PHY-2-PHY links are supported. The MAC will retransmit frames when collisions occur within
the 802.3-specified collision window.
8.11Energy Efficient Ethernet Support (IEEE 802.3az)
IEEE 802.3az adds support for energy efficiency to Ethernet. These are the key features of 802.3az
enhancements:
•
Allows a system’s transmit path to enter a low power mode if there is nothing to transmit
•
Allows a PHY to detect whether its link partner’s transmit path is in a low power mode, therefore allowing the system’s receive path to enter low power mode.
•
Ensures that the link remains up during lower power mode and no frames are dropped
•
Enables asymmetric, one direction transmissions in low power mode while the other is transmitting normally
•
Provides LPI (Low Power Idle) signaling used to control entry and exit to and from low power
modes
•
Ensures that LPI signaling can only take place if both sides have indicated support for it
through auto-negotiation
8.11.1 802.3az Operation
158
•
Low power control is done at the MII (reconciliation sub-layer).
•
As an architectural convenience, in writing the 802.3az it is assumed that transmission is
deferred by asserting carrier sense; in practice it will not be done this way. This system will
know when it has nothing to transmit and only enter low power mode when it is not transmitting.
•
Power Idle (PI) should not be requested unless the link has been up for at least one second.
•
LPI is signaled on the GMII transmit path by asserting 0x01 on txd with tx_en low and tx_en
high.
•
A PHY on seeing LPI requested on the MII will send the sleep signal before going quiet. After
going quiet it will periodically emit refresh signals.
•
The sleep, quiet and refresh periods are defined in Table 78-2 of the 802.3az specification. For
1000BASE-X the sleep period is 20us, the quiet period is 2.5ms and the refresh period is 20us.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
PCS Auto-Negotiation
•
1000BASE-X is required to go quiet after sleep is signaled. The easiest way to handle this is to
write to disable transmit in the SerDes.
•
SGMII and XFI are not part of 802.3az and should not go quiet after sleep is signaled.
•
LPI mode ends by transmitting normal idle for the wake time. A default time has been established for this condition, but it can be adjusted in software using the Link Layer Discovery Protocol (LLDP) described in Clause 79 pf 802.3az.
•
LPI is indicated at the receive side when sleep and refresh signaling has been detected.
8.11.2 LPI Operation in the MAC
System software must control LPI since this is a system level function. LPI operation is straightforward and firmware should be capable of responding within the required timeframes.
Auto-negotiation indicates EEE capability using next page autonegotiation.
For the transmit path:
•
If the link has been up for 1 second and there is nothing being transmitted, write to the LPI bit
in the network control (MPIPE_GBE_NETWORK_CONTROL) register
•
Wake up by clearing the ENA_LPI bit in the XGM transmit control (MPIPE_GBE_MAC_INTFC_TX_CTL) register
For the receive path:
•
Wait for an interrupt to indicate that LPI has been received
•
Take any software action desired to reduce power (decrease mPIPE frequency for example or
change from polling to interrupt modes on receive queues)
•
Wait for an interrupt to indicate that regular power operation has been received and then reenable any receive features that have been suspended.
8.12PCS Auto-Negotiation
An auto-negotiation block provides a means for the PCS to establish automatic link configuration.
It is performed at power-up or during normal operation if requested by a link partner or through
the restart auto-negotiation bit in the PCS control register.
By default the Gigabit Ethernet MAC has auto-negotiation enabled in the PCS control (MPIPE_GBE_PCS_CTL) register and full and half duplex capability enabled in the PCS auto-negotiation
advertisement register. The Pause capability is disabled in the advertisement (MPIPE_GBE_PCS_AUTO_NEG) register by default. If auto-negotiation is not required, then bit 12 (the
AUTO_NEG bit) of the PCS control register needs to be set LOW.
When a new base or next page is received from the link partner, a PCS link partner page received
interrupt is set [bit 17 (the PCS_PART_PAGE bit) of the interrupt status (MPIPE_GBE_INTERRUPT_STATUS) register]. The first time this interrupt is received, it indicates a base page received,
and on subsequent reads it indicate next pages.
In order for the next page exchange to work, the next page register (0x21c) must be written within
10 ms of receiving a new page from the link partner. If the link partner is requesting next pages
and GbE MAC has none to send, then the next page register should be written with the null message (0x2001). The value 0x0000 must not be written to the next page
(MPIPE_GBE_PCS_AUTO_NEG_NXT_PG) register.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
159
Chapter 8 SGMII MAC Interface
The GbE MAC signals completion of auto-negotiation through the PCS auto-negotiation complete
interrupt, on bit 16 of the interrupt status (MPIPE_GBE_INTERRUPT_STATUS) register. Auto-negotiation completion is also indicted by bit 5 of the PCS status (MPIPE_GBE_PCS_STS) register.
The PCS resolves the GbE MAC and link partner’s abilities and reports the result in the network
status register. Pause transmit and receive resolution are reported in accordance with Table 37-4
in the IEEE 802.3 specification. If full duplex capability is resolved, the duplex resolution bit is set
HIGH in the network status (MPIPE_GBE_NETWORK_STATUS) register. If half duplex capability is
resolved, the duplex resolution bit will be LOW. If the GbE MAC and its link partner cannot
resolve a common duplex capability, the duplex resolution (FULL_DUPLEX) bit is not set and link
will be indicated as being down (bit 2 in the PCS status (MPIPE_GBE_PCS_STS) register and bit 0
in the network status (MPIPE_GBE_NETWORK_STATUS) register will both be zero) when auto negotiation completes. Although the GbE MAC reports the auto-negotiation resolution, it does not
automatically reconfigure its duplex and pause states. So it is necessary for management software
to set the duplex bit in the network configuration (MPIPE_GBE_NETWORK_CONFIGURATION) register, if it reads the duplex resolution bit as being set in the network status register.
8.12.1 PCS Collision Detect and Carrier Sense
The PCS provides both the carrier sense and collision signals for use by the MAC sub-layer when
the ten bit interface (TBI) is active.
CRS (Carrier Sense) is generated by the following conditions:
•
The receiver has decoded a start of packet/end of packet or receive carrier extension is active.
This state is indicated internally to the PCS by CRS receive.
•
tx_en is active, or carrier extension is active for transmit.
The collision signal is generated whenever the PCS is requested to transmit an Ethernet frame
when the crs receive signal indicates it is active. The col signal remains active for the duration
of the collision. Both crs and col are asserted, regardless of the PCS’ mode (operating in half
duplex or full duplex mode).
8.12.2 Link Status
The PCS link status is indicated on bit 2 of the PCS status (MPIPE_GBE_PCS_STS) register, on bit 0
of the network status register, and on bit 9 of the interrupt status register. An interrupt is generated each time the PCS link status changes (that is, whenever the link is good or the link is bad).
When auto-negotiation is disabled, the link status value is determined based on whether or not
the PCS is in synchronized state. When auto-negotiation is enabled, the link status value is determined by successful completion of auto-negotiation.
8.13Statistics
The GbE MAC provides statistics counters for transmitted and received frames including byte
counts, frame counts, specific frame sizes, and specific frame types. These registers are in the
MAC starting with the MPIPE_GBE_OCTETS_TX_LO register.
8.14Filtering
Incoming packets can be checked for an exact match against up to 4 MAC addresses in the
MPIPE_GBE_SPECIFIC_ADDRESS (see MPIPE_GBE_SPECIFIC_ADDRESS_1_BOTTOM_31_0, as
an example). They can also be checked against a set of type-match registers.
160
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Filtering
The address checking, or filter block determines which receive frames should be sent to mPIPE.
The decision to send or drop a frame depends on what is enabled in the receive configuration
(MPIPE_XAUI_RECEIVE_CONFIGURATION) register, the contents of the specific address type
match, and the frame’s destination address and type/length field.
The exact address match address is split between two registers, high and low. To enable or disable exact address matching, write to the exact address registers (the
MPIPE_XAUI_EXACT_MATCH_BOTTOM_0 register, for example): the low register to disable, the
high register to enable. The first six bytes (48 bits) of an Ethernet frame make up the destination
address. The first bit of the destination address (that is, the LSB of the first byte of the frame) is
the group/individual bit. This is one for multicast addresses and zero for unicast. The all ones
address is the broadcast address and a special case of multicast.
The MAC supports recognition of four specific addresses. Each specific address requires two registers, specific address register bottom and specific address register top. Specific address register
bottom stores the first four bytes of the destination address and specific address register top contains the last two bytes. The addresses stored can be specific, group, local or universal. See IEEE
Standard 802-2001, Clause 9 for a detailed description of 802 addressing.
8.14.1 Type ID Checking
The contents of the four type match registers are compared to the length/type ID in bytes 13 and
14 of received frames. If there is a match, the frames are copied into the RX FIFO. The following
example illustrates the use of the address and type ID match registers for a MAC address of
21:43:65:87:A9:CB.
Preamble 55
SFD D5
DA (Octet0 - LSB) 21
DA(Octet 1) 43
DA(Octet 2) 65
DA(Octet 3) 87
DA(Octet 4) A9
DA (Octet5 - MSB) CB
SA (LSB) 00
SA 00
SA 00
SA 00
SA 00
SA (MSB) 00
Type ID 43
Type ID 21
The sequence above shows the beginning of an Ethernet frame. Byte order of transmission is from
top to bottom as shown. For a successful match to specific address 1, the following address
matching registers must be set up:
MPIPE_GBE_SPECIFIC_ADDRESS_1_BOTTOM_31_0 = 0x87654321
MPIPE_GBE_SPECIFIC_ADDRESS_1_TOP_47_32 0x0000CBA9
And for a successful match to type ID, the following type ID match register must be set up:
MPIPE_GBE_TYPE_ID_MATCH_1 =
0x80004321
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
161
Chapter 8 SGMII MAC Interface
8.14.2 Broadcast Address
The broadcast address of 0xFFFFFFFFFFFF is recognized if the ‘disable broadcast’ bit in the
receive configuration (MPIPE_XAUI_RECEIVE_CONFIGURATION) register is zero. To enable type
matching, set bit 31 (the ENA bit) in the type match registers (MPIPE_XAUI_TYPE_MATCH0,
MPIPE_XAUI_TYPE_MATCH1, MPIPE_XAUI_TYPE_MATCH2, and MPIPE_XAUI_TYPE_MATCH3 registers).
8.14.3 Hash Addressing
The hash address (MPIPE_GBE_HASH_BOTTOM_31_0 and MPIPE_GBE_HASH_TOP_63_32) registers is 64 bits long and takes up two locations in the memory map. The least significant bits are
stored in hash register bottom and the most significant bits in hash register top. The hash enable
(ENA_HASH_UNI) and the multicast hash enable (ENA_HASH_MULTI) bits in the network configuration (MPIPE_XAUI_RECEIVE_CONFIGURATION) register enable the reception of hash matched
frames. The destination address is reduced to a 6-bit index into the 64-bit hash register using the
following hash function, which is an exclusive OR of every sixth bit of the destination address.
hash_index[5]
hash_index[4]
hash_index[3]
hash_index[2]
hash_index[1]
hash_index[0]
=
=
=
=
=
=
da[5]
da[4]
da[3]
da[2]
da[1]
da[0]
^
^
^
^
^
^
da[11]
da[10]
da[09]
da[08]
da[07]
da[06]
^
^
^
^
^
^
da[17]
da[16]
da[15]
da[14]
da[13]
da[12]
^
^
^
^
^
^
da[23]
da[22]
da[21]
da[20]
da[19]
da[18]
^
^
^
^
^
^
da[29]
da[28]
da[27]
da[26]
da[25]
da[24]
^
^
^
^
^
^
da[35]
da[34]
da[33]
da[32]
da[31]
da[30]
^
^
^
^
^
^
da[41]
da[40]
da[39]
da[38]
da[37]
da[36]
^
^
^
^
^
^
da[47]
da[46]
da[45]
da[44]
da[43]
da[42]
In the hash function above, da[0] represents the least significant bit of the first byte received;
that is, the multicast/unicast indicator, and da[47] represents the most significant bit of the last
byte received. If the hash index points to the IDX bit that is set in the hash (MPIPE_HFH_INIT_CTL) register, the frame will be matched, depending on if the frame is multicast or
unicast. A multicast match will be signalled if the multicast hash enable (MULTI_HASH_ENA) bit is
set: da[0] is 1 and the hash index points to a bit set in the hash register. A unicast match will be
signalled if the unicast hash enable (ENA_HASH_UNI) bit is set: da[0] is 0 and the hash index
points to a bit set in the hash register. To receive all multicast frames, the hash (MPIPE_HFH_INIT_CTL) register should be set with all ones and the multicast hash enable
(ENA_HASH_MULTI) bit should be set in the receive configuration (MPIPE_XAUI_RECEIVE_CONFIGURATION) register.
162
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 9
TILE-G X I NTERLAKEN I NTERFACE
9.1 Overview
The TILE-Gx Interlaken port provides a channelized packet interface between mPIPE and 1 or
more SERDES lanes. The interface is compliant to the Interlaken Protocol Definition, revision 1.2
as well as the Interlaken Interoperability Recommendations, revision 1.4.
Note: Interlaken interface is supported in the TILE-Gx100 and TILE-Gx64 processors only.
The Interlaken interface provides the following features:
•
1 to 10 TX/RX lanes
•
3.125Gbps or 6.25Gbps per lane (OIF CEI-6G-SR)
•
Packet and burst (interleaved) modes
•
Uni-directional support
•
Asymmetrical support (can have different number of TX and RX lanes)
•
In-band or out-of-band flow control
•
Configurable burst size with optimized burst scheduler
•
Programmable flow control calendar
•
Link level and channel flow control
•
Statistics registers, diagnostics, and test patterns
•
MMIO access to registers through mPIPE’s MAC configuration interface
9.1.1
Channel Mapping
mPIPE and Interlaken both support channelized communication. While there is a one-to-one relationship between Interlaken and mPIPE channels, the numbers are different. Similarly, mPIPE
priority queues are mapped to Interlaken channels by the hardware, but the number spaces are
different.
Depending on how many lanes are in use, the Interlaken interface will occupy 16, 20, or 24 channels and priority queues.
Table 9-1 shows the mapping between Interlaken channels, mPIPE channels, and mPIPE priority
queues for various configurations.
9.2 TX Interface
Packets to be sent utilize mPIPE’s eDMA rings. Each ring is assigned to a single channel. Multiple
rings may target the same channel. The hardware guarantees that a packet on a given channel
won’t be interrupted by another ring targeting that same channel.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
163
Chapter 9 TILE-Gx Interlaken Interface
Table 9-1. Mapping between Channels and Priority
Number of Lanes
Interlaken Channels
mPIPE Channels
mPIPE Priority Queues
1-4
0..15
12..27
16..31
5-8
0..19
8..27
12..31
9-10
0..23
4..27
8..31
In packet mode, a packet won’t be interleaved with other packets. In burst mode, packet data is
interleaved from multiple channels. Packet-mode is configured by the PKT bit of the
MPIPE_ILK_TX_CTL register.
Per eDMA ring bandwidth control is provided through mPIPE’s egress arbiters which are configurable though the MPIPE_EDMA_RG_INIT_DAT_THRESH and MPIPE_EDMA_CTL registers.
9.2.1
Burst Scheduler
The Interlaken TX interface attempts to optimize the bursts being sent as per the Interlaken Protocol Definition, revision 1.2. This scheduler requires the ability to look-ahead in the packet in order
to determine burst fragmentation. The burst scheduler will provide optimum behavior if the MAX_BLKS bits of the MPIPE_EDMA_RG_INIT_DAT_THRESH register is set to at least 3 and BURST
bits of the MPIPE_EDMA_RG_INIT_DAT_THRESH register is set to 1.
If multiple mPIPE eDMA descriptors are used to generate packets and the final descriptor is
smaller than BurstShort, it is possible that the scheduler will not provide the most efficient burst
fragmentation.
9.2.2
Packet vs. Burst
The Interlaken TX interface may be configured to operate either in packet-at-a-time or burst
mode. In burst mode, packets by be interleaved at a burst boundary according to the Burst/MAX/
SHORT parameters of the Interlaken link. In packet mode, complete packets will be sent without
any interleaving of other channels’ data. Packet mode is controlled by the PKT bits of the
MPIPE_ILK_TX_CTL register.
The Interlaken RX interface supports either packet or burst interleaving modes. No special settings are required since this is a property of the transmitter.
9.3 RX Interface
Packets received from the Interlaken interface are forwarded to mPIPE on the associated channel
(see table above). Packets may be as small as 1 byte or as large as 16256 bytes. The RX interface can
receive either burst-interleaved or full packet data.
The timestamp unit applies timestamps relative to the egress from the MAC itself. The latency
from the pins of the chip to the egress of the MAC is generally consistent for a given lane configuration and data rate.
9.4 Flow Control
The TILE-Gx Interlaken interface provides a configurable flow control interface both at the link
and channel level. Flow control features include:
164
•
In band or out-of-band support
•
Programmable calendar used to map channels to flow control bits
•
Link level flow control via calendar or multi-use bits
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Statistics
•
Packet or intra-packet level response
9.4.1
Link Level TX Flow Control
The receiver provides link-level back pressure to prevent FIFO overruns due to any bandwidth
mismatch between the Interlaken RX line side and the mPIPE ingress datapath. Sufficient buffering is provided to prevent packet drops as long as the connected device is compliant with the
Interlaken latency requirements described in section 2.9 of the Interlaken Interoperability specification, Revision 1.4.
Link level flow control can be sent to the link partner on one or more calendar bits as well as a
multi-use bit. The TX link level flow control mappings are configured by MPIPE_ILK_TX_LINK_FC_CFG and MPIPE_ILK_TX_LINK_CAL_FC.
9.4.2
Channel-Based Flow Control
Flow control calendar entries are mapped to Interlaken channels via the MPIPE_ILK_TX_CAL
and MPIPE_ILK_RX_CAL registers. The Interlaken channels are mapped by hardware to the
upper most priority queues in mPIPE. For example, if Interlaken is configured to use 24 channels,
priority queues 8 to 31 would be mapped to channels 0 to 23 respectively. A programmable high
water mark for each of the priority queues determines how much relative space each queue is
allocated in the mPIPE iPKT buffer.
9.4.3
Link Level RX Flow Control
One or more calendar bits received from the link partner can be mapped to RX link level flow control (for example back pressure the TILE-Gx Interlaken transmitter). Any one of the multi-use bits
may also be assigned for ling level back pressure. The RX link level flow control mappings are
configured by MPIPE_ILK_RX_LINK_FC_CFG and MPIPE_ILK_RX_LINK_CAL_FC.
When the link level flow control bit is indicating XOFF, transmission is terminated from all channels. The amount of skid data is compliant with the Interlaken Interoperability specification,
revision 1.4.
For ports operating in packet mode as per the PKT bit of the MPIPE_ILK_TX_CTL register, the
flow control can be applied at a packet boundary or within the packet. This is based on the setting
in PKT_FC_FAST bit of the MPIPE_ILK_TX_CTL register.
9.4.4
Out-of-Band Flow Control
The TILE-Gx Interlaken interface supports both in-band and out-of-band flow control as per the
Interlaken specification. When out-of-band flow control is enabled, dedicated chip pins carry the
flow control information. Out-of-band flow control is required for uni-directional links.
9.5 Statistics
The TILE-Gx Interlaken interface provides the Interlaken-Alliance recommended statistics registers. This includes per-channel byte and packet counts as well as many per-lane error statistics.
Each counter generates an interrupt on overflow via the MPIPE_ILK_INTERRUPT mechanism.
These interrupts are reflected in the MPIPE_ILK_CTR_OVFL_nn registers.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
165
Chapter 9 TILE-Gx Interlaken Interface
9.6 Initialization
Configuration of an Interlaken link requires cooperation between the two attached components.
Parameters such as bit rate, number of lanes, type of flow control, flow control calendar, channel
mapping, BurstMax, and BurstMin must be configured identically on both sides of the link. All
link parameters must be established before enabling the MAC via the MPIPE_MAC_ENABLE register.
9.7 Error Handling
A misconfigured link or poor physical channel can cause various types of link errors. Many types
of errors are detected by the hardware and reflected in the MPIPE_ILK_INTERRUPT_STATUS
register. Received packets with bad CRC are forwarded to mPIPE with the descriptor’s CE (CRC
error) bit asserted.
FIFO overruns occur when the link partner fails to obey a link-level flow control event. These
errors generally indicate both an oversubscription of mPIPE bandwidth and an improperly configured flow control calendar at the attached device’s transmitter.
FIFO overrun errors are reflected in the mPIPE descriptor’s ME (MacError) bit in the cases where
partial packet fragments have been forwarded.
166
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 10
USB I NTERFACE
10.1Overview
The TILE-Gx Universal Serial Bus (USB) system includes two host controllers and one device endpoint controller. The system is USB 2.0-compliant. Note that USB 2.0 is backward-compatible with
1.x devices, but has a higher transfer rate of 480 Mb/s (HS) than the 12 Mb/s (FS) transfers of USB
1.1, and the 1.5 Mb/s (LS) transfers of USB 1.0. Refer to www.usb.org for more information about
the USB 2.0 specification.
Two sets of the UTMI+ Low Pin Interface (ULPI) interface connect the system to external USB
PHYs. Host controller [0] and endpoint controller [0] share a set of the processor’s external connections via USB0. Host controller [1] has dedicated ULPI connections to an external PHY via
USB1.
The USB subsystem uses the Mesh networks to communicate with the Tile cores via the memory
networks, which includes one reQuest Dynamic Network (QDN), one Share Dynamic Network
(SDN), and two Response Dynamic Network (RDN) networks. These networks carry MMIO
requests, data transfers, and interrupt requests. From the Tile Processors’ viewpoint, the host controllers and the device endpoint channels are located at the same Mesh coordinates.
In order to support bootloading and debugging over USB, the USB0 endpoint controller can operate in a standalone mode without any software driver running on the Tile side. In this mode, the
USB endpoint controller handles all standard USB requests from an external USB host. Users can
boot, debug, and use the tile-monitor application for data movement between the external
host controller and the Tile Processors.
USB1
USB0
12-Pin
12-Pin
H
H
E
CH2
CH1
CH0
Figure 10-1: Channel Descriptions
This chapter is organized as follows: 10.2 External I/O Interface describes the external PHY interface. 10.3 Mesh Interface presents the iMesh network that is used to access the memory system.
Details of the host controller and the device endpoint channels are described in 10.4 Host Controller and 10.5 Device Endpoint. 10.6 Standalone Device Operation describes the standalone
endpoint system design.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
167
Chapter 10 USB Interface
10.2External I/O Interface
The USB system connects to the PHY through the chip I/O pins and uses the ULPI the interface.
As compared to a typical USB 2.0 Transceiver Macrocell Interface (UTMI) PHY package size (of 48
to 56 pins), the ULPI protocol reduces the link to PHY interface to eight signals, since it is optimized for use as an external PHY. A series of ULPI signals are used to manage the external PHY.
Table 10-1 lists the ULPI interface signals.
Table 10-1. ULPI Interface Signals
Signal Name
Direction
Description
Clock
PHY to Link
Control and data signals are synchronous to Clock.
Data
I/O
Driven low by the Link when in IDLE state. The Link starts a transfer by sending a non-zero pattern. The PHY must assert Dir before
using the data bus. A turnaround cycle is required every time that
Dir toggles (changes direction from inbound to outbound).
DIR
PHY to Link
Direction of the data bus. By default, Dir is low and the PHY listens
for non-zero data from the Link. The PHY asserts Dir to gain control of the data bus.
NXT
PHY to Link
Next data. The PHY drives NXT high to throttle the data bus.
STP
From Link to PHY
Stop data. The link drives STP high to signal that there is an end of
its data stream. The Link can also drive STP high to request data
bus access from the PHY.
A set of the 12-pin chip I/Os is shared by a host controller and the endpoint device. The connection is selected by a chip configuration pin (CONFIG_USB[0]). When CONFIG_USB[0] is deasserted, the port is used by the USB endpoint device. Software can program the connection by
disabling the configuration pin and setting the STRP_PIN_DISABLE field in the USB_DEVICE_USB_PORT0_SELECT register and enabling the host controller by setting the
HOST_ENABLE field in the same register. Another set of 12-pin chip I/Os is always used by the
second host controller.
10.3Mesh Interface
The iMesh connects to the USB system and provides:
•
Access from the Tile Processors to the USB system via loads and stores in the MMIO address
space.
•
Access from the host controller channels to the memory system to manage data transfers.
•
Interrupt notification to the Tile Processors.
10.3.1 MMIO Interface
Tile software communicates with the USB system via loads and stores in the MMIO address space
using the QDN network. The MMIO space is comprised of the general system configuration registers, MAC registers, and the RX/TX FIFIO storage. The Response Dynamic Network (RDN)
network carries the read data and the write acknowledge responses.
168
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Mesh Interface
The physical address (up to 256 KB of offset within the region) in the MMIO loads and stores is
formatted as follows:
Channel
(bit 39, 38)
Reserved
(bit 37 ~ bit 20)
Protection
(bit 19, 18)
Register Offset
(bit 17 ~ bit 0)
Table 10-2. MMIO Interface Format
Bits
Name
Description
39:38
Channel
Channel. This field is used for access to one of the following:
• Channel 0: Endpoint
• Channel 1: Host 0
• Channel 2: Host 1
Several unique structure configuration registers are located only at the channel 0
MMIO space though the structures can be only used in the host channels. These
include MMIO_ADDRESS_SPACE definition, TLB-related registers, and Hash-forHome configuration registers.
For more information about the MMIO Address Space, refer to Table 11, “TILE-Gx
Physical Memory Space Descriptions,” on page 2.
37 ~ bit 20
Reserved
Reserved
19 ~ bit 18
Protection
Specifies the register access privilege level.
17 ~ bit 0
Register Offset
Register Offset is defined as follows:
• Bit [15:0]
Register address
• Bit-16
Selecting MAC configuration and status registers by setting the bit.
Note that all the MAC register accesses are 4-byte operation, and
the non-MAC registers accesses are 8-byte operation.
• Bit-17
Selecting Open Host Controller Interface (OHCI) or Enhanced Host
Controller Interface (EHCI) MAC registers in the host channels.
OHCI MAC registers are addressed with bit [17:16] = 2’b11 while
EHCI MAC registers are addressed with bit [17:16] = 2’b01.
10.3.2 Memory Access
The host controller systems can access the memory system via the iMesh network. The SDN network provides the reads and writes to/from the cache system, and the RDN carries the read data
or write acknowledge responses.
The host EHCI controller can generate 64-bit or 32-bit I/O addresses, and the OHCI controller can
generate 32-bit addresses. The I/O addresses are the data pointers in the virtual address space,
and are translated to the memory address using a TLB structure. There are 16 TLB entries per host
channel. In the address translation, an Address Extension Register is used if the controller is in
32-bit address mode. The top bits of the virtual address are ignored in the TLB lookup. Note that
the TLB supports all standard TILE-Gx I/O -TLB attributes.
For read requests, a 64-byte data is always returned to the system from the memory to signal a
normal completion. Data is buffered to provide subsequent data reads, and it is invalidated if
there are writes or any MMIO requests to the channel.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
169
Chapter 10 USB Interface
For write requests, a maximum of 4-byte data is always sent to the memory system without a need
for any write coalescing. A write acknowledgement is required before any transaction completion
notifications can be generated. Hardware always performs a memory-fence operation, i.e. all the
previous write acknowledgements are received, before processing writes to any different cacheline (64 bytes) data blocks.
10.3.3 Interrupt Interface
The USB system generates interrupts that will be delivered to the Tile Processors if the interrupt
bindings are enabled. Dedicated interrupt binding can be used for each of the events listed in the
USB_DEVICE/HOST_INT_VEC0_W1TC and USB_DEVICE/HOST_INT_VEC1_W1TC). These events
include:
•
MAC interrupt per channel
•
TLB fault handling for host channels
•
Internal interface bus error per channel
•
MMIO configuration error per channel
When a MAC interrupt occurs, software must clear the interrupt registers inside the MAC before
clearing the system interrupt registers. This prevents generation of spurious interrupts to the Tile
Processors.
10.4Host Controller
One EHCI and one OHCI interface are implemented in a host controller channel. These interfaces
comply with Enhanced Host Controller Interface (EHCI) Specification, Version 1.0 (http://
www.intel.com/technology/usb/ehcispec.htm), and the Open Host Controller Interface
(OHCI) Specification, Version 1.0a (ftp://ftp.compaq.com/pub/supportinformation/
papers/hcir1_0a.pdf). Both interfaces support 32-bit addressing, and 8- or 32-bit data transfers, while the EHCI has the 64-bit addressing capability.
The EHCI controller provides descriptor and data prefetching for the next USB packets while the
current USB packet is still active on the USB bus. After the current USB packet transmission ends,
the next packet can immediately go on the bus, because the descriptor and the data are already
fetched from the system memory, thus increasing USB throughput. Up to four descriptors and up
to 4KB of data (up to 8 Bulk OUT transactions of 512 byte each) can be prefetched. Unused
descriptors are discarded at the end of the (micro)frame. The OHCI Controller supports the Keyboard/Mouse Legacy Emulation Interface.
10.5Device Endpoint
10.5.1 Configuration
The device endpoint supports one configuration and up to four interfaces, with one alternative
interface provided for each interface. In addition to the default Endpoint 0, seven extra sets of
endpoint registers are provided. These endpoints can be paired without any restriction for both
IN and/or OUT directions for the same logical endpoint number.
10.5.2 MAC Design
The MAC is implemented with the Slave-Only mode design. Little user-intervention is required
because the software is uncomplicated, although users can use a dedicated master for data processing. The application initiates all data transfers to the memory-mapping RX/TX data storage in
the channel. The device acts as a slave to all the data and CSR transfers in the Tile Processors. The
device then responds to the application through a dedicated sideband interrupt.
170
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Standalone Device Operation
The TxFIFO is written using the MMIO transfers. After one maximum-size data is written, and
the IN endpoint status updated, a TxFIFO controller instantiates the data movement for the designated endpoint.
For the OUT (outbound) transactions, the data is written to the Receive FIFO, if space is available.
All endpoints share a common set of receive FIFOs. An address FIFO is used to track the endpoint
number and a flag to differentiate regular data from the eight bytes of SETUP data. The application reads the Endpoint Status register to determine the number of bytes to be transferred, and
then initiates the data transfers.
10.5.3 MAC Interrupts
The MAC provides a single interrupt signal to indicate that at least one interrupt condition exists,
as described in the sections that follow.
10.5.3.1
Device Interrupts
The Device Interrupt Register tracks system-level events. The application clears the interrupt by
writing a 1’b1 to the correct bit. A Device Interrupt Mask Register can mask the designated interrupt. These events are:
SC
The device has received a Set_Configuration command.
SI
The device has received a Set_Interface command.
ES
An idle state is detected on the USB for duration of 3 milliseconds.
UR
A reset is detected on the USB.
US
A suspend state is detected on the USB for duration of 3 milliseconds, following the 3-millisecond ES interrupt activity due to an idle state.
SOF
An SOF token is detected on the USB.
ENUM
Speed enumeration is complete.
RMTWKP STATE INT
A Set/Clear Feature (Remote Wakeup) is received by the core.
10.5.3.2
Endpoint Interrupts
The Endpoint Interrupt Register tracks the endpoint-level interrupts. Since all eight endpoints can
be bidirectional, each endpoint has two interrupt bits (one for each direction). An Endpoint Interrupt Mask Register can mask the designated interrupt.
The following events are categorized as endpoint-related events:
•
Reception of a request for IN data
•
Reception of an OUT data packet
•
Reception of eight bytes of SETUP data packet
•
An application error resulting in an internal
•
Advanced High-Performance Bus (AHB) Error Response.
10.6Standalone Device Operation
The device endpoint system can be operated without software intervention. The special mode
operation is activated by the chip strap pin USB_CONFIG[0], and can be disabled by the software
by changing the USB0 port ownership to the host controller or setting the DISABLE field in the
USB_DEVICE_CFG_STANDALONE_DEVICE_CONFIG register.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
171
Chapter 10 USB Interface
The device endpoint system provides the chip boot and debugging capabilities, including handling standard USB host requests, interrupts, and data transfers between the host and Rshim. By
default, no device or endpoint interrupts are forwarded to the Tile Processors during the process.
10.6.1 Interface and Endpoint Configuration
In the standalone device operation, the device is designed to have one configuration and two
(boot/debug and tile-monitor) interfaces. Four endpoints in addition to the default Endpoint 0 are
used. Table 10-3 summarizes the configuration.
Table 10-3. Standalone Device Configuration
Interface Number
Interface Name
Hard-Coded Endpoint Number
Endpoint Type
1
Boot/Debug
1
Bulk Outa
2
Tile-Monitor
2
Interrupt In
3
Bulk Out
4
Bulk In
a. Refer to the USB specification for the Bulk and Interrupt Endpoint definition.
After the connection is established, the external USB host can access all device, configuration,
string, interface, endpoint, device_qualifier and other_speed_configuration descriptor information.
10.6.2 Boot/Debug Interface
There are two endpoints in the boot/debug interface. Endpoint 0 is a control endpoint that can
access the Rshim registers with the Tilera-specific command, described in Table 10-4.
Table 10-4. Format of Setup Data (see USB Specification Revision 2.0 Table 9-2)
Offset
Field
Size (Byte)
Value
Description
0
bmRequestType
1
Bitmap
D7: direction
D6…5: Type (2’b10: Vendor)
D4…0: Recipient
1
bRequest
1
Value
Rshim Command (8’b0)
2
wValue
2
Value
Rshim Register Channel (4-bit)
4
wIndex
2
Index or Offset
Rshim Register Index (16-bit)
6
wLength
2
Count
Data bytes to transfer
(must be 8 bytes)
For the boot operation, the host controller first uses Endpoint 0 to read the specific Rshim register
(RSH_PG_CTL) to calculate the maximum amount data that can be transferred. The actual data is
moved using Endpoint 1, a Bulk OUT endpoint, from the host controller to the chip. The boot data
is required to be 8-byte multiples in each data transfer, and the target Rshim register is always
RSH_PG_DATA.
The debug operation uses the Endpoint 0 as well to read and write Rshim registers. The USB logic
guarantees that any read or write requests through Endpoint 0 are delivered to the Rshim with
higher priority than the requests generated by Endpoint 1.
172
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Standalone Device Operation
10.6.3 Tile-Monitor Interface
There are two data buffers implemented in Rshim for the Tile-Monitor data movement. The host
controller reads from Rshim RX data buffer, and writes to the TX data buffer to communicate with
the Tiles. The flow control between the host controller and Rshim is managed by the following
registers:
•
RSH_TM_HOST_TO_TILE_STS should be read from the host controller through the Endpoint 1
to calculate how much data it can send to Rshim.
•
RSH_TM_HOST_TO_TILE_DATA is the target register for the USB device to deliver the data.
•
RSH_TM_TILE_TO_HOST_STS should be read from the host controller through the Endpoint 3
to calculate how much data it can receive from Rshim.
•
RSH_TM_TILE_TO_HOST_DATA is the source register for the USB device to request data.
For data moving from the Tiles to the host controller, Endpoint 2 is the Interrupt IN endpoint to
poll the RSH_TM_TILE_TO_HOST_STS periodically. If the read succeeds and there exists data to
be transferred, the endpoint returns the number of bytes back to the host, otherwise the endpoint
returns NACK. The USB device fetches data (less or equal to 512 bytes) from Rshim next time
when the host controller makes an Endpoint 4 Bulk IN request. Rshim uses NACK to indicate the
data fifo empty condition.
For data moving from the host controller to the Tiles, the host controller uses the control Endpoint
0 to access the RSH_TM_TILE_TO_HOST_STS to calculate the data (must be 8-byte multiples in
each data transfer) to be sent. The host controller then uses Endpoint 3 Bulk OUT requests to
deliver data.
The USB logic guarantees any read or write requests through Endpoint 0 or 2 are delivered to
Rshim with higher priority than the requests generated by Endpoint 3 and 4.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
173
Chapter 10 USB Interface
174
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 11
(M I CA)
C OMMON A CCELERATOR I NTERFACE
11.1 Introduction
This chapter describes the architecture of the TILE-Gx™ Multicore iMesh Coprocessing Accelerator (MiCA™).
The MiCA provides a common front-end (both SW and HW) to various IO off-load or acceleration
functions, for example Crypto or Compression. The MiCA performs operations as specified by an
opcode on data in memory. The exact set of operations that it performs is dependent on the specific MiCA implementation. For instance, the TILE-Gx Crypto Engine is built around the MiCA
architecture with the set of operations it can perform specified as cryptographic operations.
The MiCA uses a memory-mapped I/O (MMIO) interface for the array of tiles. Because it uses the
MMIO interface, access to the control registers can be controlled through the use of in-tile TLBs.
The memory mapped interface enables tiles to instruct the MiCA to perform operations from user
processes in a protected manner. Memory accesses performed by the MiCA are validated and
translated by an I/O-TLB, which is located in the MiCA. This allows completely protected access
for operations that user code instructs the MiCA to execute.
The MiCA connects to TILE-Gx’s memory networks and processes requests, which come in via its
memory mapped I/O interface. A request consists of a Source Data Descriptor, a Destination Data
Descriptor, a Source Data Length, an Operation to perform, and an optional Pointer to Extra Data
(ED). Many requests can be in flight at one time as the MiCA supports a large number of independent Contexts, each containing their own state. An operation is initiated by writing the request
parameters to a Context’s User registers.1
As the operation progresses, the MiCA verifies that the memory that is accessed by the operation
can be accessed legally. If the operation instructs the MiCA to access data, which is not mapped
by the Context’s I/O-TLB, a TLB Miss interrupt is sent to the Context’s bound tile. It is the responsibility of the tile I/O-TLB miss handler to fill the I/O-TLB. At the completion of the operation the
MiCA sends a completion interrupt to the Context’s bound tile.
Because the MiCA is multi-contexted, multiple operations can be serviced at the same time. Each
MiCA implementation has some number of processing engines (for example, crypto, compression,
etc.) and a Scheduler, which assigns requesting Context’s to those engines. All Contexts are independent from each other. Under typical operation, a Context is allocated to a particular tile and
that tile instructs operation of the Context.
A Context is not multi-threaded. If a tile needs overlapped access to a MiCA accessible accelerator, multiple Contexts can be utilized by a single tile.
1. This is the Context that is used at the same level as Engine and Scheduler. For a definition of Context refer to “Glossary, Conventions and Standards” on page 585.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
175
Chapter 11 Common Accelerator Interface (MiCA)
This chapter is organized with the MiCA common features defined in the main text, and implementation specific details in other chapters (for Crypto and Compression implementations). For
additional information, see Chapter 12: Cryptographic Accelerator Interface or Chapter 13: Compression Accelerator Interface.
11.2 Overview and Major Functional Blocks
Figure 11-1 shows a high level block diagram for the MiCA. Descriptions of each of the sub-blocks
are provided in the following section.
MMIO Registers and Context State
Mesh Interface
Context Registers
TLB
Global Registers
Context
Specific
State
Network
Interfaces
PAs
VAs
Context
Assignments
Engine
Assignments
Engine
Scheduler
Engine Status
Read Data
RDN
MMIO Read Data
and Write Acks/
IPI Interrupts
Write Data
Read or Write Requests
QDN
MMIO Requests
Operation
Requests
To/From
Tiles
Read Requests
SDN
Memory Read and
Write Requests
PA to Route
Header
Generation
Egress DMA
(From
Memory
to MiCA)
Read Data
Notification
Engine Front End
Write Requests
RDN
Memory Read Data
and Write Acks
Ingress DMA
(From MiCA
to Memory)
Write Data
Notification
Function-Specific Engines
(For example Crypto or Compression)
Figure 11-1: MiCA Block Diagram
176
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Overview and Major Functional Blocks
11.2.1 Major Blocks
The next sections describe the MiCA using Figure 11-1 as a reference point.
11.2.1.1
Mesh Interface
The MiCA interfaces to the Tiles via the mesh interface. The connections are:
•
QDN In – Receives Tile MMIO accesses to MiCA.
•
RDN Out – MMIO read data and MMIO write acks from MiCA to tiles. Also MiCA sends IPI
interrupts to Tiles.
•
SDN Out – MiCA to memory read requests and memory write requests and write data.
•
RDN In – Memory read data and write acks to MiCA.
11.2.1.2
MMIO Registers and Context State
Tile access and control of the operation of the MiCA is provided via a set of memory mapped registers. Tiles access the registers via MMIO writes and reads to setup operations and check status.
Address Space
The MiCA physical address is partitioned as described in Table 11-1 and illustrated in Figure 11-2
(note that the MiCA is selected by its x/y mesh coordinate).
25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
Register
Partition
Usage of the Bits by Partition
Select
9
8
7
6
5
4
3
2
1
0
Byte Offset
10
Context Number
Register Number
Must be 0.1 Context User Space
11
Context Number
Register Number
Must be 0.1 Context System Space
00
01
Register Number
Engine Number
Register Number
Must be 0.1 Global Space
Must be 0.1 Engine Access Space
Figure 11-2: MiCA Physical Address1
Note: The specific Hypertext links provided in the text that follows are to the compression instance
of MiCA registers. There is a corresponding register in the crypto instance of the MiCA
registers.
Context Registers
The specific number of Contexts supported is defined by each MiCA implementation. Each Context has two distinct sets of registers for which different protection levels can be assigned,
typically for User and System space.
Source Descriptor
The Source Descriptor is defined in “Source Data” on page 181.
1. The Byte Offset is zero, because only 8-byte accesses are allowed.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
177
Chapter 11 Common Accelerator Interface (MiCA)
Table 11-1. MiCA Physical Address
Bits
Name
Description
25:24
Register Partition
Note that the partitions are spaced 16MB apart. This provides access control; if the entire user partition is mapped into a 16MB page, that page will
not also include any system partition registers.
10
Context User Space
11
Context System Space
00
Global Space
01
Engine Access Space
Context User Space and Context System Space
23:14
Context Number
A 10-bit block allows for up-to 1k Contexts (the actual number is configurable per design instance of a given MiCA instantiation; a typical usage is to
select that number based on the number of Tiles in a given chip). Each Context is aligned to 16kB address boundary; this allows each Context to be
protected via Tile memory management on a 16kB page or four Contexts to
be grouped in a 64kB page.
13:3
Register Number
These 11 bits allow for 2k registers, but many fewer are defined.
Note: Writes to unused addresses are dropped; reads from unused
addresses return 0x0.
2:0
Byte Offset
All registers are 8-bytes and must be written via 8-byte store instructions
and read via 8-byte load instructions.
Global Space
23:3
Register Number
These 21 bits allow for 2M registers, but many fewer are defined.
Note: Writes to unused addresses are dropped; reads from unused
addresses return 0x0.
2:0
Byte Offset
All registers are 8-bytes and must be written via 8-byte store instructions
and read via 8-byte load instructions.
Engine Access Space
23:18
Engine Number
These six bits allow for up-to 64 Engines. The actual number is configurable
per design instance of a given MiCA instantiation.
17:3
Register Number
These 15 bits allow for 32k registers, but many fewer are defined.
Note: Writes to unused addresses are dropped; reads from unused
addresses return 0x0.
2:0
Byte Offset
All registers are 8-bytes and must be written via 8-byte store instructions
and read via 8-byte load instructions.
Destination Descriptor
The Destination Descriptor is defined in “Destination Data” on page 182.
Extra Data Pointer
The Extra Data Pointer is optional, depending on the operation being performed. It is defined
in MICA_COMP_CTX_USER_EXTRA_DATA_PTR.
Operation Length
The Operation Length is defined in “Destination Data” on page 182. See also
MICA_COMP_CTX_USER_OPCODE.
OPCODE
OPCODE consists of the following fields. Refer to MICA_COMP_CTX_USER_OPCODE.
178
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Overview and Major Functional Blocks
•
Engine Type for the requested operation. The ENGINE_TYPE field defines which type of
Engine in the MiCA should perform the operation. Types 0 and 1 are common for all accelerators; Types 2 through 7 are defined uniquely for each accelerator.
•
Source Mode
•
•
Single Buffer Descriptor
•
List of eDMA Descriptors
Destination Mode
•
Single Buffer Descriptor
•
List of Buffer Descriptors. If this mode is selected, the number of Buffer Descriptors in
the list is also specified.
•
Overwrite Source Data
•
Extra Data Size (number of 8-byte words)
•
Destination Size. This field is dependent on the engine type, and specifies the destination
size as a function of source size.
•
The MICA_COMP_CTX_USER_OPCODE can also contain some fields specific to a given MiCA
implementation, based on its capabilities. If so, they are described with the engine-specific
information (for example in Chapter 12: Cryptographic Accelerator Interface).
Note: Operation length and opcode are packed into one register; this allows a complete operation
to be specified in four MMIO writes (versus five).
•
In_Use – reads the value from COMP_PENDING bit of the MICA_COMP_CTX_USER_CONTEXT_STATUS register.
•
•
Allows User to poll for completion instead of receiving an interrupt.
User Status
•
Number of bytes written to destination.
•
Error status bits.
Context System Registers
•
Completion Interrupt Binding register (MICA_COMP_CTX_SYS_COMP_INT).
•
TLB Miss Interrupt Binding register (MICA_COMP_CTX_SYS_TLB_MISS_INT).
•
Interrupt Mask registers:
•
Interrupt Mask register (MICA_COMP_CTX_SYS_INT_MASK)
•
Interrupt Mask Set register (MICA_COMP_CTX_SYS_INT_MASK_SET)
•
Interrupt Mask Reset register (MICA_COMP_CTX_SYS_INT_MASK_RESET).
•
Miss Virtual Address (VA) register (MICA_COMP_CTX_SYS_MISS_VA).
•
TLB Table (16 entries per Context) register (MICA_COMP_CTX_SYS_TLB_TABLE).
•
Probe VA register (MICA_COMP_CTX_SYS_PROBE_VA).
•
TLB Probe Status register (MICA_COMP_CTX_SYS_PROBE_STATUS).
•
Control register (MICA_COMP_CTX_SYS_CONTROL).
•
System Status register (MICA_COMP_CTX_USER_CONTEXT_STATUS).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
179
Chapter 11 Common Accelerator Interface (MiCA)
Figure 11-3 shows the set of states that Context can be in, as well as the transitions between the
states.
RUN
W AIT
RUN
RESET
ID LE
PAU SE
RW
PAU SE
ID LE
R ESET
W AIT
Figure 11-3: Context States
Legend
180
State
Description
IDLE
No operation in progress. Context User registers may be written to setup a new operation.
RUN WAIT
Context User registers have been written, operation is waiting to be assigned to an engine.
RUN
Operation has been assigned to an engine and is running. This is the only state in which the
Context will access the TLB and initiate memory accesses.
RESET WAIT
Control register RESET bit was written to 1, engine is waiting for in-flight memory accesses to
complete.
PAUSE IDLE
Control Register
PAUSE bit was written to 1 no operation has been requested.
PAUSE RW
Control Register
PAUSE bit was written to 1, there is also an operation requested.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Overview and Major Functional Blocks
Global Registers
Global registers refer to those registers in the Global Register Space, as illustrated in Figure 11-1
on page 176. These registers are common to all I/O devices and are described in “Device Discovery” on page 3.
11.2.1.3
Operand Data Specification
Each MiCA operation reads Source Data, operates on it, and then writes it out as Destination
Data. It can optionally read and/or write Extra Data. The next sections define the details of the
operands.
Extra Data
The MiCA architecture allows users to specify Extra Data that will be needed by an operation.
Some operations done in MiCA might not need any Extra Data. For example, encryption keys are
specified in Extra Data; a memory-to-memory copy does not use any Extra Data. When used, the
Extra Data is specified by its Virtual Address in the Extra Data Ptr (MICA_COMP_CTX_USER_EXTRA_DATA_PTR) register and a length, which is contained in the MICA_COMP_CTX_USER_OPCODE
register. The length is defined as number of 8-byte words, which must be padded with zeroes if
the actual data used is not a multiple of 8-byte words.
Source Data
Source data can be specified as either a Multicore Programmable Intelligent Packet Engine
(mPIPE™) Buffer Descriptor (refer to Section 6.2.2 Buffers on page 94), or a list of mPIPE eDMA
Descriptors (refer to the “eDMA Descriptor Format” section), as determined by the SRC_MODE
field of the MICA_COMP_CTX_USER_OPCODE register.
•
0 = Single Buffer Descriptor. Source Data register (MICA_COMP_CTX_USER_SRC_DATA) contains the Buffer Descriptor bitfield (BUFFER_DESC), which can be either chained or unchained.
Note that the MICA_COMP_CTX_USER_OPCODE register SIZE field specifies the total source
data length.
•
1 = List of eDMA Descriptors. Source Data register contains a VA pointer (VA) to a list of mPIPE
eDMA Descriptors. The maximum size of the list is four eDMA Descriptors, and the pointer
must be cacheline-aligned.
eDMA Descriptor Format
The eDMA Descriptor format is the same as the one mPIPE uses, although some fields used by
mPIPE do not apply to the MiCA and are ignored. See “eDMA Packet Descriptors” on page 124.
for more information. MiCA uses only the Size, Bound (Boundary bit), and Buffer Descriptor fields. Refer to Figure 6-3 on page 95 for more information on these fields.
The MiCA Source Mode can be either set to a single buffer descriptor or to a list of buffer descriptors. Figure 11-4: Using a List of eDMA Descriptors as a MiCA Source Mode" on page 182 shows
the case where the Source Mode is a list of eDMA descriptors. Each eDMA descriptor in the list
contains two sub-fields: a Size field and a Buffer Descriptor field. There can be up to four eDMA
descriptors in the list. In the example depicted in Figure 11-4, the first eDMA Descriptor points to
a buffer chain. (Please refer to 6.2.2.2 Buffer Chaining). The other three eDMA Descriptors can
point to different buffer chains. The Source Descriptor (Src Desc) should be assigned the VA of
the list (array) of up to four eDMA Descriptors.
MiCA can process one, two, three, or four eDMA Descriptors in one operation (note that processing a single eDMA Descriptor is not illegal, but can be performed more efficiently using the Single
Buffer Descriptor source mode). If the list has four eDMA Descriptors, the Boundary bit is
implied as set on the fourth descriptor. Having a Size of 0 in any of the eDMA Descriptors
results in an error.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
181
Chapter 11 Common Accelerator Interface (MiCA)
MiCA Context User Registers
Src Desc
Dest Desc
Memory
Memory
Buffer Descriptor
Size (and other fields)
Buffer Descriptor
...
Buffer Descriptor
Buffer-0
Buffer-n
eDMA Descriptor
eDMA Descriptor
eDMA Descriptor
Buffer Chain
eDMA Descriptor
Figure 11-4: Using a List of eDMA Descriptors as a MiCA Source Mode 1
Fields of the Buffer Descriptor used by the mPIPE that do not apply and are ignored are:
•
Gen – Generation number.
•
CSUM – Checksum generation enabled.
•
NS – NoSend.
•
CSUM_START – Start byte of checksum.
•
CSUM_DEST – Destination of checksum.
•
Notif – Notification interrupt.
•
StackIDX – 52:48. MiCA does not manage stacks of buffers as mPIPE does.
•
Format of HWB. MiCA does not release buffers as mPIPE does.
Unused fields must be 0.
For more information about the Buffer Descriptor, refer to “iDMA Packet Descriptors”
on page 100.
Destination Data
Destination data can be specified as either a mPIPE Buffer Descriptor (refer to Section 6.2.2 Buffers
on page 94) or a list of mPIPE Buffer Descriptors, as determined by DST_MODE field of the MICA_COMP_CTX_USER_OPCODE register.
•
0 = Single Buffer Descriptor. Destination Data register (MICA_COMP_CTX_SYS_CONTROL or
MICA_COMP_CTX_USER_OPCODE) contains the Buffer Descriptor field (BUFFER_DESC). This
descriptor can be either unchained or chained but large enough to hold all the destination data;
specifying a chain of buffers is illegal.
•
1 = Overwrite Source Buffers. MICA_COMP_CTX_USER_DEST_DATA register is not used, the destination data is written into the Source Buffers.
1. See also 6.5.2 eDMA Packet Descriptors.
182
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Overview and Major Functional Blocks
•
2 = List of Buffer Descriptors. MICA_COMP_CTX_USER_DEST_DATA register contains a VA
pointer to a list of Buffer Descriptors. The number of descriptors in the list is specified in the
DEST_MODE bitfield of the MICA_COMP_CTX_USER_OPCODE register, and has a maximum size
32. The pointer must be cacheline-aligned.
MiCA treats each buffer descriptor in the list as a single buffer only. If the Chain field is
Chained, the Size field specifies the size of the buffer and the next descriptor is ignored. If
the field is Unchained, then the Size field is ignored.
Note that the Operation Length register (MICA_COMP_CTX_USER_OPCODE) specifies the total
source data length. The destination data length is determined by the operation, and can be equal
to, greater than, or less than the source data length. If it is greater than the source data length, the
DEST_SIZE field of the MICA_COMP_CTX_USER_OPCODE register indicates how much additional
space is available.
Figure 11-5 shows the case where the Destination Mode (Dest Mode) is set to a list of buffer
descriptors. Dest Desc should be assigned the VA of a list (array) of up to 32 Buffer Descriptors, as
specified by the number of destination buffer descriptors (NUM_DEST_BD). Each Buffer Descriptor
points to a corresponding buffer in the chain.
MiCA Context User Registers
Src Desc
Dest Desc
Memory
Memory
Buffer Descriptor
Buffer Descriptor 0
C=1
Buffer Descriptor
...
1
Buffer-0
Buffer-31
2
3
...
31
Figure 11-5: Using a List of Buffer Descriptors as a MiCA Destination Mode
The array can contain up to 32 buffers. The status register will report the number of destination
data bytes written, from which the number of used buffers can be determined.
The chaining field in the Buffer Descriptors must be set to 1. Otherwise, the MiCA output will be
written to the corresponding buffer irrespective of the size specified in the Buffer Descriptor.
11.2.1.4
TLB (Translation Lookaside Buffer)
The TLB is used to store VA-to-PA (Virtual Address-to-Physical Address) translations. It is partitioned per Context, with each Context having 16 entries. Tiles write to and read from the TLB and
can initiate probes to it. The MiCA performs lookups in TLB.
•
MMIO accesses – Tiles read and write entries in TLB.
•
Translation lookups are done by MiCA at operation startup and when data buffers cross page
boundaries.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
183
Chapter 11 Common Accelerator Interface (MiCA)
11.2.1.5
Engine Scheduler
Scheduling consists of assigning Contexts, which have operations to perform, to hardware
resources to perform them (for example, the Function Specific Engines). This function is necessary
because there are many more Contexts than Engines. Once an Engine is scheduled to a Context, it
completes the operation; that is, without being time-shared within an operation.
11.2.1.6
Function Specific Engines
There are multiple Processing Engines, each one is capable of performing a given algorithm, for
example, for Crypto AES, DES, MD-5, SHA, etc. The exact list and number of instances of each
type is specific to each MiCA implementation, and typically is much lower than the number of
Contexts.
11.2.1.7
DMA Channels
The DMA Channels move data between memory and the Function Specific Engines.
Egress DMA is for reading data from memory (packet data and related operating parameters),
and Ingress DMA is for writing packet data to memory (note that this convention is the same as
for the mPIPE, where Ingress packets travel from external interface into memory, and Egress
packets travel from memory to the external interface).
Each Engine has dedicated DMA Egress and Ingress channels assigned to it, so that no Engine is
blocked by any other.
11.2.1.8
PA to Header Generation
This block takes the physical address and page attributes from a DMA channel and converts that
into a route header to pass to the Share Dynamic Network (SDN) Out mesh interface.
11.3 Operation Flow
This section describes the high-level flow of an operation through the MiCA, followed by some
specific operation details.
11.3.1 General Flow
1.
Tile software or hardware puts source data in memory. For example, the data could be a
packet received by the mPIPE.
2.
Tile software allocates memory for destination data.
3.
Tile software puts extra data, if needed, in memory.
4.
Tile software writes parameters describing the operation into its allocated Context Registers.
5.
The Context requests use of an Engine from the Scheduler.
6.
When an Engine is available, the Scheduler assigns a waiting Context to it.
7.
The Engine reads operation parameters from the Context’s registers.
8.
The Engine accesses data from memory, in the following order:
a. If the Source Mode is a List of eDMA Descriptors, read the list.
b. If the Destination Mode is a List of Buffer Descriptors, read the list.
c. If Extra Data is required, read the Extra Data.
d. Read Source Data. For more information refer to 6.2.2 Buffers on page 94.
184
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Operation Flow
9.
Source data is processed (for example, encrypted/decrypted, compressed/decompressed,
etc), and output is written to Destination.
Note: Engine performs TLB lookups to translate VAs, as needed. TLB misses generates an IPI
interrupt to the Tile (if not masked). Only one TLB Miss will occur at a time (subsequent
lookups do not happen under a miss.
The Engine overlaps the next TLB lookup with current data access.
The Engine continues reading source data and writing destination data until the operation is complete.
10. Set the COMP_PENDING bitfield of the MICA_COMP_CTX_USER_CONTEXT_STATUS register and
send an IPI interrupt (if not masked). The interrupt also acts as a memory fence; it is not sent
until the destination data is visible in memory.
Note that Contexts request specific types of operations by specifying the Engine Type (ENGINE_TYPE field) in the MICA_COMP_CTX_USER_OPCODE register and the Scheduler assigns the specific
engine (step 5 above). This allows for a MiCA implementation to include multiple copies of a
given engine for higher processing throughput. Each engine also has a number, by which it can be
accessed for system operations — for example, performance monitoring, reset, etc. In general the
specific engine number does not correlate to Engine Type. The Engine Number is not significant
to the Context.
11.3.2 Tile Interrupts
Interrupts can be sent for the following reasons:
•
TLB Misses
•
Normal Completion
Each of the interrupt types has its own Bound Tile, as specified in Context System registers. Each
also has a Pending bit in the Context System MICA_COMP_CTX_USER_CONTEXT_STATUS register,
and a MASK bit in the Context System Interrupt Mask register (MICA_COMP_CTX_SYS_INT_MASK).
This combination allows for polling interrupt usage, and also for temporarily deferring interrupts. The operation is:
1.
The appropriate TLB_MISS_PENDING bit is set during the course of the operation, for example if a TLB Miss occurs or the operation completes.
a. When the Pending bit, either the COMP_PENDING or TLB_MISS_PENDING bit, is a 1 and
the associated Interrupt Mask bit is 0, an IPI interrupt will be sent to the Bound Tile. Note
that the IPI is sent via the Response Dynamic Network (RDN), which is also used for MMIO
read responses and MMIO write acks. This means that a MMIO response will be ordered
behind an earlier IPI that had been sent.
b. If the MASK bit is a 1 the IPI interrupt will not be sent.
c. If the MASK bit is written from 1 to 0 the pending interrupt will be sent.
The MASK bits can be written directly at the Interrupt Mask register (MICA_COMP_CTX_SYS_INT_MASK) address or set/reset by writing a 1 to the Interrupt Mask Set
(MICA_COMP_CTX_SYS_INT_MASK_SET)/Interrupt Mask Reset register (MICA_COMP_CTX_SYS_INT_MASK_RESET) address, respectively.
The interrupt types are independent, but there are some dependencies:
•
Normal Completion cannot happen if there is a TLB Miss outstanding. The operation cannot complete without the required translation.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
185
Chapter 11 Common Accelerator Interface (MiCA)
Note: Strictly speaking, if the TLB Miss is prefetching the next translation, it is not required
for the operation. However, the TLB Miss must still be acknowledged in order for the
operation to complete.
•
2.
There will only be one TLB Miss outstanding at a time (there is only one MICA_COMP_CTX_SYS_MISS_VA register per Context).
After Completion interrupt occurs there will be no subsequent TLB Miss interrupts.
The Tile must dismiss the interrupt. This can be done by writing a 1 to the interrupt’s Pending
bit in the Context System Register. Alternately, all of the Pending bits are cleared by writing
the MICA_COMP_CTX_USER_OPCODE register to start the next operation (this in effect implicitly acknowledges the operation completion, and saves an MMIO write which would be done
to dismiss the Completion interrupt). TLB Miss interrupt Pending bit is also cleared by filling
a TLB entry when the TLB Miss Ack bit (TLB_MISS_ACK) value is 1 in the write of the entry’s
TLB Attributes (this also implicitly clears the TLB Miss interrupt bitfield (MICA_COMP_CTX_SYS_TLB_MISS_INT), refer to “TLB Miss”).
11.3.3 Specific Use Examples
The next sections give some examples of specific uses. This list is not exhaustive; other uses not
covered here are also possible.
11.3.3.1
General Use
Contexts are assigned to Tile User level processes; the mechanism for doing that is beyond the
scope of this discussion. It is assumed that as part of the setup process the Context User Registers
for the assigned Context are mapped into the User’s virtual memory space and the Context System Completion Interrupt Binding register (MICA_COMP_CTX_SYS_COMP_INT) is set to interrupt
the User process. The User can then directly initiate operations and receive Completion Interrupts. The TLB Miss Interrupt Binding is set to interrupt System-level software.
An operation is initialized by writing the operation parameters into the Context User registers.
The write of the MICA_COMP_CTX_USER_OPCODE register triggers the operation to start.
Note: Once the MICA_COMP_CTX_USER_OPCODE register is written, the operation parameters
should not be written again until the operation completes, as determined by the Context
sending a completion interrupt.
The parameters are Source and Destination Descriptors, a virtual address pointer to extra data (if
needed, for example Crypto parameters like keys, etc.), and the opcode and length of the
operation.
Normal completion is reported by a Completion Interrupt. When the destination data size is
known beforehand, the User Process does not need to read any status from the Context in response
to a Completion Interrupt. When the destination data size is not known, the User Process reads the
MICA_CONTEXT_STATUS register to get that information. After it sends the Completion Interrupt,
the Context is ready to accept the next operation. Note that completion can also be determined by
polling, if desired.
TLB Misses are normally bound to System-level software and are transparent to the User level.
11.3.3.2
TLB Miss
During normal processing operations, the Engine looks up VA-to-PA translations in the associated Context’s partition of the TLB (note that the TLB is partitioned such that each Context has its
own set of entries). If a translation is not found, it will set the Context’s TLB Miss Interrupt Pending status bit. Normally the TLB Miss interrupt will not be masked and therefore an interrupt will
186
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Operation Flow
be sent to the Bound Tile. Alternatively, software can mask the interrupt and poll the TLB Miss
Interrupt Pending bit (TLB_MISS_PENDING). In either case the software follows the steps
described below in response to a TLB Miss.
1.
Read the Context’s MICA_COMP_CTX_SYS_MISS_VA register, to find the VA that was not
found in TLB, and also a suggested TLB Index to load with the new translation.
2.
Access the page table to find the appropriate PA and Attributes. This task is software-dependent, and not in the scope of this document.
3.
Write the Context’s TLB Address register for the entry being replaced. This write clears the
Valid bit for the entry (the entry spans two registers, so this action makes the update free of
races).
4.
Write the Context’s TLB Attributes Register for the entry being replaced. Writing this register
with TLB Miss Ack bit (TLB_MISS_ACK) = 1 also clears the TLB Miss Pending Interrupt bit
(TLB_MISS_PENDING), which implicitly dismisses the TLB Miss interrupt.
Note that translations are done ahead of time, in parallel with data processing, so most times
the operation will not be stalled waiting on the translation. If the operation did become
stalled, dismissing the interrupt will unstall it. The Engine does not swap to another Context
operation when stalled, so TLB misses should be dismissed quickly to minimize lost Engine
throughput.
5.
After the TLB is filled, the Engine will repeat the lookup that missed.
If for some reason it is not possible for software to complete the TLB fill, the operation must be
terminated as described in “Terminate Operation for a Specific Context” on page 188.
11.3.3.3
Deferred Interrupts
When a User Process is swapped out of a Tile, it will not be available to receive interrupts. Note
that in this case the User Process still wants the operation to proceed (so it will not be terminated
or stalled), but simply defers sending a Completion Interrupt.
In some instances, System software might need to defer TLB Miss Interrupts.
The MICA_COMP_CTX_SYS_INT_MASK register enables the ability to perform this function. The
Mask Register has three alias addresses – Interrupt Mask (MICA_COMP_CTX_SYS_INT_MASK),
Interrupt Mask Set (MICA_COMP_CTX_SYS_INT_MASK_SET), and Interrupt Mask Reset (MICA_COMP_CTX_SYS_INT_MASK_RESET). Using the set and reset addresses eliminates the need to do a
read-modify-write to the register.
The hardware keeps masked interrupts in a pending state until it can deliver them when the mask
is cleared, as described in the Section 11.3.2 Tile Interrupts.
Note that if the operation completes while the Completion Interrupt is masked, the Engine is
freed up for a new operation. However, if a TLB Miss occurs while TLB Miss Interrupt is masked,
the Engine still waits for the TLB Miss to be serviced (for example, the Engine does not swap to
another Context operation). In general, software should be aware that deferring TLB Miss service
can have a negative impact on Engine throughput.
11.3.3.4
Pause Context
The following sequence can be used by software to pause a Context for a period of time. Normally
System software will do this when it wants to remap memory. Once it is paused, no memory
accesses or TLB lookups will be initiated for the Context.‘
Note: the
1.
Write a 1 to the Pause bit in the Context’s Control register (MICA_COMP_CTX_SYS_CONTROL).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
187
Chapter 11 Common Accelerator Interface (MiCA)
2.
If the Context has not been programmed for an operation, or was programmed but not yet
assigned to an Engine, it will go to Paused state immediately.
3.
Poll the Context’s Status Register state field until the Context state is Paused Idle or Paused
Run Wait. If the Context was already assigned to an Engine it will complete the operation and
then go to Paused Idle. Refer to Figure 11-3.
4.
The Status register (MICA_COMP_CTX_USER_CONTEXT_STATUS) also indicates if the Context
has any pending interrupts. If it does and the interrupt was unmasked, the software must wait
for the interrupt to be delivered.
5.
Write a 0 to the PAUSE bit of the MICA_COMP_CTX_SYS_CONTROL register to allow the operation to continue.
The sequence described above can be done regardless of the state of the operation, that is, the
PAUSE bit can be written whether or not an operation has been requested. If one was requested,
the PAUSE bit can be written whether it is queued waiting for an Engine, or already assigned to an
Engine. This allows the process that requests operations and the process that manages remapping
memory to operate independently.
11.3.3.5
TLB Probe
The TLB can be probed by software to check for the presence of a given translation.
1.
Write the VA to be checked to the Context’s MICA_COMP_CTX_SYS_PROBE_STATUS register.
2.
Read the hit/miss status and index in the Context’s TLB MICA_COMP_CTX_SYS_PROBE_STATUS register.
This operation can be done regardless of the Context’s state.
11.3.3.6
TLB Shootdown
When operating system software needs to move or remove pages of Physical Memory it must
coordinate with agents (including Tile and non-Tile agents) that have copies of translations in the
TLBs. The following sequence is used to coordinate removing entries from the TLB and getting an
acknowledgment that the operation has completed.
1.
Pause the Context as described in “Pause Context” above. This is necessary to insure that all
in-flight memory operations that might be using the affected pages are completed.
2.
Remove the entry(s) affected. For each page:
a. Write the VA to the Context’s Probe register.
b. Read the hit/miss status and index in the Context’s MICA_COMP_CTX_SYS_PROBE_STATUS
register (PROBE_STATUS).
c. If hit, write the Context’s TLB Attribute register (TLB_ENTRY_ATTR) of the index that hit;
writing the Valid bit (VLD) to 0.
d. An alternative to probing is to invalidate all of the Context’s TLB entries by writing each
entry’s valid bit to 0.
3.
Un-pause the Context to allow it to continue by writing a 0 to the PAUSE bit of the MICA_COMP_CTX_SYS_CONTROL register to allow the operation to continue.
11.3.3.7
Terminate Operation for a Specific Context
When a process running on a Tile is terminated prior to completion, for example due to some
error, any I/O operation associated with it should also be terminated. The following sequence is
used to terminate the operation for a Context owned by that process:
188
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Operation Flow
1.
Write a 1 to the Context’s MICA_COMP_CTX_SYS_CONTROL register RESET bit.
2.
Poll the Context’s MICA_COMP_CTX_USER_CONTEXT_STATUS register state field until the
Context state is IDLE.
a. If the Context has not been assigned to an Engine before RESET bit of the MICA_COMP_CTX_SYS_CONTROL register is written, it will go to IDLE state immediately and not be assigned
to an Engine.
b. If the Context has been assigned to an Engine prior to RESET being written, it will terminate any operation as quickly as possible, but might need to wait for in-flight memory
operations to complete before going to IDLE state.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
189
Chapter 11 Common Accelerator Interface (MiCA)
190
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 12
I NTERFACE
C RYPTOGRAPHIC A CCELERATOR
This chapter describes the TILE-Gx Crypto implementation of Multicore iMesh Coprocessing
Accelerator (MiCA™).
12.1 Engines
There are five engines in Crypto MiCA:
•
One Memory-to-Memory Copy Engine, described in “Memory-to-Memory Copy Engine” on
page 191
•
Two Tilera Crypto Packet Processors, described in “Crypto Packet Processor” on page 192
•
One Tilera KASUMI core and Tilera SNOW-3G core, described in “KASUMI and SNOW-3G
Engine” on page 193
•
One Tilera Public Key Accelerator (PKA), described in “Public Key Accelerator Engine” on
page 195
12.2 Schedulers
The Memory-to-Memory Copy, Crypto Packet Processor, and KASUMI/SNOW-3G Engines are
each scheduled by their own Scheduler. All the schedulers have four priority levels. Scheduling
fixed priority across levels and using the round-robin policy within each level with programmable timers guarantees that lower levels do not get “starved”.
12.3 Contexts
The Crypto MiCA supports forty Contexts.
12.4 Engine-Specific Details
The next sections provide unique details of each of the Engines.
12.4.1 Memory-to-Memory Copy Engine
Memory-to-Memory copy operations are specified by Engine Types 0 and 1 in the MICA_CRYPTO_CTX_USER_OPCODE register. Type 0 does a copy of source data to destination; Type 1 does a
copy of inverted source data to destination. No extra data is used for memory-to-memory copy.
The Engine number for memory-to-memory copy is 0.
12.4.1.1 Usage Constraints for the Engine
This section describes both guidelines and constraints for using the Engine.
None
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
191
Chapter 12 Cryptographic Accelerator Interface
12.4.2 Crypto Packet Processor
Crypto Packet Processor operations are specified by Engine Type 2 in the MICA_CRYPTO_CTX_USER_OPCODE register. Two copies of the engine (Engine Number 2 and Engine Number 3) are
included to achieve higher total throughput, and the scheduler selects which engine is assigned,
which is transparent to the Contexts requesting the operation.
Note: Any particular Internet Protocol Security (IPSec) flow needs to go through one particular
Engine.
Crypto Packet Processor performs operations on a packet through the use of a token and a Context
Record. Extra Data is used to create the Input Token and Context Record used by Crypto Packet
Processor, and to receive the Result Token created by Crypto Packet Processor.
•
An Input Token is a series of commands informing the Crypto Packet Processor Engine how to
process the packet; for example, what headers to insert or remove, what encryption algorithms
to use, etc. The size of the Input Token, in number of 4-byte words, is specified in the OPCODE
register.
•
The Context Record supplies information used by the Crypto Packet Processor, such as encryption keys, IVs, etc.
•
The Result Token supplies status about the operation, including errors detected by Crypto
Packet Processor.
Note: The detailed description of the Input Token, Context Record, and Result Token is provided
in “Packet Processor — Programming” on page 264.
The overall flow for using Crypto Packet Processor is described below. Note, this list highlights
the unique parts of the flow, relative to the steps described in Chapter 11: Common Accelerator
Interface (MiCA), in Section 11.3.1 General Flow.
1.
In the Extra Data step of the General Flow, Tile software puts Input Token and Context
Record in memory, and also allocates space (32 bytes) for Result Token.
2.
If the operation prepends and/or appends data onto the destination packet, Tile software allocates space and sets the Destination Size field (SIZE) of MICA_CRYPTO_CTX_USER_OPCODE
register with the appropriate value.
3.
Crypto Packet Processor reads the Input Token, Context Record, and source data, and then
performs the operation.
4.
After writing destination data, the Crypto Packet Processor Engine writes the updated Context Record and Result Token to Extra Data area.
12.4.2.1 Usage Constraints for the Crypto Packet Processor Engine
This section describes both guidelines and constraints for using the Crypto Packet Processor
Engine.
192
•
Configuration/initialization for an engine is done through MMIO access to the engine registers.
See “Inline Packet Engine” on page 371 for general information, Appendix D: for the register
map, and 11.2.1.2 Overview and Major Functional Blocks of this document to map and access
the registers for a particular engine. Only configuration registers ( D.9.1 Configuration Registers on page 463 should be accessed, and only before any packets are processed. The processor
comes up in a usable state, so it might not be necessary to make any changes to the configuration.
•
Many of the runtime registers are accessed by the MiCA and should not be written.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Engine-Specific Details
•
Do not confuse context record with MiCA context — they are not related.
•
Tokens for various operations are provided by Tilera. The user must fill in certain fields in the
token on a per-packet basis before sending it to the Crypto Packet Processor Engine. See the D.8
Context Record Definition on page 433 and “Inline Packet Engine — Token Examples” on page
491 for details.
•
MiCA takes care of DMA of packets.
•
See “Result Token Definition” on page 429 document for a description of the Result Token.
12.4.3 KASUMI and SNOW-3G Engine
The KASUMI and SNOW-3G cores share a channel, so only one can be processing an operation at
a time. The scheduler assigns Contexts that are requesting operations from either of these cores in
a round-robin manner, which is transparent to the Contexts requesting the operation.
KASUMI operations are specified by Engine Type 4 in the MICA_CRYPTO_CTX_USER_OPCODE
register, and SNOW-3G operations are specified by Engine Type 5.
The Engine number for KASUMI/SNOW-3G is 1.
Both of these cores use OPCODE register and Extra Data as listed below. The Engine Type field is
used to determine which core is used and, therefore, how to use the Extra Data.
12.4.3.1 KASUMI Engine
Note: The detailed description of the Extra Data fields by KASUMI is described in “KASUMI
Engines” on page 261.
Table 12-1. OPCODE Register and Extra Data Description
Bits
Name
Description
56:53
OPCODE
0011
KASUMI Encrypt
0010
KASUMI Decrypt
0101
f8
1001
f9
All other values are illegal.
Extra Data – 4 Words
Word 0
31:0
config
This field is only used for f8 mode and f9 mode.
Bit [0], direction
Specifies the direction of the f8 or f9 session (uplink or
downlink).
Bit [5:1], bearer
Specifies the Bearer value for f8 sessions.
Bit [15:6]
Reserved.
Bit [31:16], length Specifies the length of the message data in bits during f9
mode. The total message length can be up to 2^16 =
65536 bits.
63:32
Reserved
Word 1
31:0
count
This field is only used during f8 mode and f9 mode. This is the 32-bit count
value.
63:32
fresh data
This field is only used during f9 mode. This is the 32-bit fresh value.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
193
Chapter 12 Cryptographic Accelerator Interface
Table 12-1. OPCODE Register and Extra Data Description (continued)
Bits
Name
Description
key
Bits [63:0] of the 128-bit key.
key
Bits [127:64] of the 128-bit key.
Word 2
63:0
Word 3
127:64
Note: For all operations except f9, the destination data size is the same as source data size. For f9 the
destination data size is 4 bytes.
12.4.3.2 SNOW-3G Engine
Note that the detailed description of the Extra Data fields by SNOW-3G is described in “SNOW3G Engines” on page 255.
Table 12-2. OPCODE Register and Extra Data Description
Bits
Name
Description
54:53
OPCODE
00
01
10
11
UEA2/128-EEA1 decrypt
UEA2/128-EEA1 encrypt
UIA2/128-EIA1 decrypt
UIA2/128-EIA1 encrypt
Extra Data – 5 Words
Word 0
63:0
iv
Initialization Vector. This field contains bits 63:0 of the 128-bit IV.
• For UEA2 and 128-EEA1 the IV is constructed as follows: {COUNTC|BEARER |DIRECTION|026}
• For UIA2 the IV is constructed as follows: {COUNT-I|FRESH}
• For 128-EIA1 the IV is constructed as follows: {COUNT-I|BEARER|027}
iv
Initialization Vector. This field contains bits 127:64 of the 128-bit IV.
• For UEA2 and 128-EEA1 the IV is constructed as follows: {COUNTC|BEARER |DIRECTION|026}
• For UIA2 the IV is constructed as follows: {COUNT-I|FRESH}
• For 128-EIA1 the IV is constructed as follows: {COUNT-I|BEARER|027}
key
SNOW Key. This field contains bits 63:0 of the 128-bit key SNOW operations.
key
SNOW Key. This field contains bits 127:64 of the 128-bit key SNOW operations.
Word 1
127:64
Word 2
63:0
Word 3
127:64
194
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Engine-Specific Details
Table 12-2. OPCODE Register and Extra Data Description (continued)
Bits
Name
Description
length
Length Vector. This field is the number of bits for each new authentication
operation. The maximum value for this input is 2^16-64 (which is 65472).
Word 4
15:0
Note: The maximum length value defined by [SNOW3G] is 20000 bits.
63:16
Reserved
Note: For all operations destination data size is the same as source data size.
12.4.3.3 Usage Constraints for the Engine
This section describes both guidelines and constraints for using the Engine.
•
As it applies to Extra Data (ED) alignment, in the MICA_CRYPTO_CTX_USER_OPCODE register,
ED size is the number of 8-byte words.
•
If it is an odd number of 4-byte words, the CR is padded with zeros at the end to fill out the
remainder of the ED size.
12.4.4 Public Key Accelerator Engine
The Public Key Accelerator (PKA) is not accessed via MiCA Context Registers; instead it is
accessed via the Crypto Global MMIO space. The reason for this is that the attributes of the operations done by the Public Key Accelerator (PKA) are different than the other engines. In general
small operands (for example, ~.5kB – 2.5kB) are operated on for a long time (for example, ~100k –
250k cycles). The time spent moving the operands and results relative to the computation time are
small and the latency benefit provided from DMA versus programmed I/O is minimal.
The Public Key Accelerator control and status registers, as documented in “Public Key Accelerator Engine” on page 195, occupies 0x10 through 0x13. This allows effectively 20 bits of offset, 18
bits from the normal register offset (which is [17:0]), plus two bits from what is normally the
Engine Number ([19:18]).
At Engine Number 0x14, Offsets 0-64kB address the Tile-to-PKA Window RAM.
Offsets 64kB and above address the Tilera-specific CSRs.
“Host memory” as used here and in Tilera Public Key Accelerator documentation refers to the
Tile-to-PKA Window RAM, as shown in Figure 12-1. This RAM appears to Tiles as a block of
MMIO registers, and to the Public Key Accelerator as 64kB at addresses 0 to 0xFFFF.
The Public Key Accelerator also contains a high-performance True Random-Number Generator
(TRNG).
The PKA command interface uses descriptor/result rings held in Host memory space. Descriptors
do not contain any vector data – they contain pointers to vectors in Host memory space.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
195
Chapter 12 Cryptographic Accelerator Interface
MMIO Registers and Context State
Mesh Interface
Context Registers
TLB
Global Registers
Context
Specific
State
To/From
Tiles
QDN
MMIO Requests
Engine
Scheduler
RDN
MMIO Read Data
and Write Acks/
IPI Interrupts
Network
Interfaces
Read Requests
SDN
Memory Read and
Write Requests
PA to Route
Header
Generation
Egress DMA
(From
Memory
to MiCA)
Tile-to-PKA Window RAM
Read Data
Notification
Operands
Response
Queues
Results
IPI Generator
Interrupt
Bindings
Engine Front End
Write Requests
RDN
Memory Read Data
and Write Acks
Command
Queues
Ingress DMA
(From MiCA
to Memory)
Write Data
Interrupts
Notification
Function-Specific Engines
Public Key Accelerator
Figure 12-1: Public Key Accelerator in Crypto MiCA
12.4.4.1 Descriptor Ring Management
Descriptors are 32 bytes in size. Up to four separate command descriptor rings can be used, each
accompanied by a result descriptor ring of the same size. Command and result descriptor rings
can be co-located or placed at different (non-overlapping) locations in Host space. The number of
descriptors on each ring configurable from 1 to 64k in the Public Key Accelerator RING_SIZE_n
registers (but because the rings must be allocated in the PKA Window RAM, the actual maximum
size is limited).
If multiple rings are used, rotating priority will normally be used to select which ring is to supply
the next PKA command to execute. It is possible to place Ring 0 at a higher priority than the
remaining rings. It is also possible to turn the rotating priority off (in which case Ring 0 gets the
lowest priority and Ring 3 the highest priority).
We recommend using separate rings if large differences in execution times for commands are
expected. This prevents any stalls from results for short execution time commands from being
stalled by one result for a long execution time command. The reason for this is that most of the
internal buffer RAM is used to buffer command descriptors from each command ring – no new
commands can be loaded when the oldest command in this buffer has not yet completed.
Read/write pointers for the rings should be kept locally by the Host and the PKA master controller (the latter will use some words of buffer RAM to hold them, providing progress indication and
a re-sync capability). No true ‘ownership’ bits are used in the descriptors – these are not necessary
196
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Engine-Specific Details
as the command counters can be used to determine if new commands can be written – result
descriptors contain two ‘written zero’ bits that can be used (by a driver) for ownership indications, but are mainly intended to prevent result interrupt race problems.
12.4.4.2 Command Descriptor Contents
Command descriptors are 32 bytes (8 words of 32 bits) long. The presence of new command
descriptors in a ring must be indicated by the Host through incrementing of the command
counter associated with that ring (after the descriptor contents have been written). It is possible to
link command descriptors so that one command can only be executed when the previous linked
command has been executed – these linked descriptors must be transferred into the ring as a
whole (that is, the command counter must be incremented by the number of linked commands).
The descriptor contains pointers to operands in Host memory, a command field indicating which
command to execute and some other miscellaneous information.
12.4.4.3 Result Descriptor Contents
After finishing a command, the PKA master controller will convert the original command
descriptor into a result descriptor and write this descriptor to the result ring. The result descriptor
contains status information about the operation.
12.4.4.4 Interrupts
Public Key Accelerator generates the following interrupts.
•
Command queue empty, four interrupts, one per queue. The threshold is configurable in Public
Key Accelerator MICA_CRYPTO_ENG_IRQ_THRESH_n register. The interrupt should only be
dismissed (by writing to Public Key Accelerator MICA_CRYPTO_ENG_AIC_ACK register) when
new commands are added to the queue.
•
Result queue full, four interrupts, one per queue. The threshold is configurable in the Public
Key Accelerator MICA_CRYPTO_ENG_IRQ_THRESH_n register. Also a timeout value for interrupting on non-empty result queue is configurable in the same register. Dismissing the interrupt is done by writing to Public Key Accelerator MICA_CRYPTO_ENG_RESULT_COUNT_n
register (to indicate that results have been processed) followed by writing Public Key Accelerator MICA_CRYPTO_ENG_AIC_ACK register (to acknowledge the interrupt). Note that the writes
must be done in this order.
•
PKA Master interrupt.
Interrupt generated directly by the PKA master controller to request attention from the Host
(intended for signaling errors and/or completed commands).
•
TRNG interrupt.
Interrupt generated directly by the PKA master controller to request attention from the Host
(intended for signaling errors and/or completed commands).
Each interrupt has an associated Tile Interrupt Binding MICA_CRYPTO_ENG_INT_BINDING_PKA_QUEUE_n_EMPTY register. When the Public Key Accelerator generates an interrupt it is
marked as pending in the IPI Generator logic, which then arbitrates to send an IPI to the bound
tile. When the IPI is sent, the interrupt is marked as not pending. The implication of this is that
multiple Public Key Accelerator interrupts might coalesce into a single IPI, depending on how
quickly they occur.
Interrupt Latency
In order to keep the Public Key Accelerator fully loaded, the driver must keep its input queue(s)
full enough so that there is a new command ready when a farm engine is available. A rough
example of interrupt service latency is provided below.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
197
Chapter 12 Cryptographic Accelerator Interface
Assumptions:
1.
Crypto frequency – 800 MHz.
2.
Core frequency – 800 MHz. Note that higher core frequency relative to Crypto frequency will
give a longer latency period, since a given PKA operation is measured in cycles of Crypto
frequency.
3.
A minimum PKA operation – 80k cycles
4.
Only one Command / Result queue configured.
5.
Command / Result queue size – 20 entries. This uses 1280 Bytes in the PKA Window RAM if
the Command and Result queue do not overlap, and 640 Bytes if they do. There is enough
space for maximum size operands (4 kbit) to be statically allocated.
6.
Result queue threshold – 10 entries. This is chosen to moderate the number of interrupts to
one every 10 operations.
With these specifications, there will be 10 entries on the command queue when the interrupt is
triggered. The command queue must be refilled before those operations are completed, which
takes a minimum of 800k cycles. That time can be increased by either having a smaller threshold
(more interrupts), or by having more queue entries, which would require dynamically allocating
space for operands.
198
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 13
I NTERFACE
C OMPRESSION A CCELERATOR
13.1 Overview
The software interface uses the MiCA™ standard interface as described in Chapter 11: Common
Accelerator Interface (MiCA). This chapter describes those features unique to the compression
functionality.
The TILE-Gx™ architecture supports hardware acceleration for lossless compression and decompression operations. The Raw DEFLATE format (RFC1951) and GZIP file format (RFC1952) are
supported by the hardware accelerators.
The TILE-Gx36 implementation supports full-duplex, 10 Gb/s compression and 10 Gb/s decompression. The TILE-Gx36 implementation supports a common Multicore iMesh Coprocessing
Accelerator (MiCA) front-end architecture. The TILE-Gx36 implementation provides two TILE-Gx
ZIP controllers. Each controller has one compression engine and two decompression engines.
Each controller supports forty Contexts.
The following terms are used in this chapter:
•
Transaction. The operation associated with one opcode MMIO write.
•
Context. Hardware entity that contains information (interrupt bindings, state, source and destination data pointers, opcodes, and other engine-specific information) for processing data.
•
DEFLATE. A lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding.
13.2 Data Flows
13.2.1 Typical Compression Flow
The API configures one of the Contexts via MMIO packets over the QDN network. The API specifies the opcode, transfer size, source data descriptor, and the destination data descriptor. Once a
Context is programmed, it generates a pending compression transaction to be processed. The
GZIP controller selects the next compression transaction, if there is an available engine to process
the transaction. New Contexts are scheduled on a round-robin basis. The selected Context is considered “active”.
The eDMA engine first performs a TLB lookup, and then decodes the source data descriptor. The
engine then fetches the uncompressed source data from the memory over the SDN network, cache
line-by-cache line. If the source data is not aligned with the cache line boundary, the eDMA
engine discards the unused bytes from that line. The source data might not be returned in the
order that it was requested. The eDMA engine assembles the returned source data from the RDN
network. The eDMA engine uses its own network buffer to smooth out the latency jitter from the
memory space. If the data comes from the cache, the latency tends to be lower than if it comes
from external DRAM.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
199
Chapter 13 Compression Accelerator Interface
The compression engine performs the data compression. The raw DEFLATE format (RFC1951) and
GZIP format (RFC1952) are supported.
The iDMA engine assembles the compressed data generated from the compression engine. Then it
sends the compressed data, cacheline-by-cacheline, back to memory over the SDN network. The
iDMA engine uses its own network buffer to smooth out the latency jitter to the memory space. A
masked write will be used for transfers smaller than one cache line; pad writes are not supported.
After the iDMA engine sends the last piece of the compressed data to the memory space, it waits
for an acknowledgement that all writes to memory are visible. Then a completion interrupt is dispatched over the RDN network. The API then checks the completion status via an MMIO packet.
This packet includes an indication of successful completion or an error status. It also indicates the
number of bytes of compressed data.
A new transaction operation (from a different Context) can be scheduled to the same compression
engine once the last flit of the data from the previous transaction is on the way to the mesh network.
13.2.2 Typical Decompression Flow
The decompression flow is very similar to the compression flow: the API configures a context, the
eDMA engine fetches the source data, the decompression engine performs the decompression,
and the iDMA engine sends the compressed data back to memory. A completion interrupt is dispatched after the writes are visible in the memory space.
The iDMA engine for decompression handles a higher output data rate than the one for compression, as the uncompressed data is typically larger than the compressed data.
13.3 Compression Engine
13.3.1 Engine Configuration
The compression engine can be customized in four different input patterns. Typically, accomplishing a better compression ratio requires more processing time — the pattern you choose
would accommodate that requirement. Users must weigh these considerations when deciding
how to configure the engine for a given application. Users can specify the following parameters:
•
NICE_MATCH
•
GOOD_MATCH
•
MAX_CHAIN_LENGTH
•
MAX_DIST
•
MATCH0_RATIO
•
MATCH1_RATIO
•
DTREE
•
BINARY_HINT
•
SMALL_PACKET_HINT
•
HIGH_COMPRESSION_HINT
Refer to the compression engine register definitions in “Compression/Decompression Engine
Registers” on page 202 for more details.
200
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Decompression Engine
13.3.2 GZIP Handling
The hardware engine can be configured to generate compressed data in GZIP format. The configuration can be set using the opcode flag or the MICA_COMP_ENG_DEFL_REG_DEF_CTL register.
The GZIP header will always be generated as the following:
Table 13-1. GZIP Header Format
Structure
Value
ID1
0x1F
ID2
0x8B
CM
0x8
FLG
0x0
MTIME
0x0
XFL
0x4
OS
0x3 (UNIX)
The GZIP trailer contains two fields: a CRC32 (ISO 3309) for integrity checking and the input size
of the original (uncompressed) data.
13.4 Decompression Engine
13.4.1 GZIP Handling
The hardware engine automatically detects the format of the compressed data, either raw
DEFLATE format or GZIP format.
If the GZIP format is detected, the engine parses the header fields, including all optional fields,
such as a header CRC16, extra fields, the file name, and comments. For header integrity, the
extracted header fields are used to check against the optional CRC16. The context status is
updated accordingly in case of an error. The engine does not forward the extracted header fields
to the user space, as the header fields are in readable format and can be parsed by the user application. For payload integrity, the uncompressed data is checked against the CRC32. The size of
the uncompressed data is checked against the input byte (ISIZE) that is part of the GZIP trailer.
13.5 Memory-to-Memory Copy
Memory-to-Memory copy operations are specified by Engine Types 0 and 1 in the MICA_COMP_CTX_USER_OPCODE register. Type 0 does a copy of source data to destination; Type 1 does a
copy of inverted source data to destination. No extra data is used for memory-to-memory copies.
Compression engine and decompression engines are capable of supporting memory-to-memory
copy operations. At run time, compression engine or decompression engine (but not both) should
be configured to support memory-to-memory copy operations.
13.6 API
The compression controller supports a common MiCA front-end architecture.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
201
Chapter 13 Compression Accelerator Interface
13.6.1 Context Registers
Each GZIP controller supports 40 Contexts. Each context has two distinct sets of registers, which
allows for assigning different protection levels, typically for User and System space. The User
space registers typically include the OPCODE register, source data descriptor register (MICA_COMP_CTX_USER_SRC_DATA), destination descriptor register
(MICA_COMP_CTX_USER_CONTEXT_STATUS), and operation length register (MICA_COMP_CTX_USER_OPCODE). The System space registers typically include the interrupt binding registers
(MICA_COMP_CTX_SYS_COMP_INT), TLB handling registers (MICA_COMP_CTX_SYS_TLB_MISS_INT, for example), and system status registers
(MICA_COMP_CTX_USER_CONTEXT_STATUS), etc. Refer to Chapter 11: Common Accelerator
Interface (MiCA) for more details.
13.6.2 Compression/Decompression Engine Registers
Each compression engine and decompression engine has its own set of registers, including the
configuration registers and performance counters.
OPCODE
The GZIP controller supports the following ENGINE_TYPEs in the MICA_COMP_CTX_USER_OPCODE register.
Table 13-2. ENGINE_TYPE Register Description
ENGINE_TYPE
Description
000
Memory copy
001
Memory copy with data inversion
010
Compression
011
Decompression
1xx
Reserved
Each MICA_COMP_CTX_USER_OPCODE register has an engine-specific field to specify a flag associated with the ENGINE_TYPE.
The compression engine supports the flags listed in Table 13-3. Refer to the Deflate Engine Configure Control register (MICA_COMP_ENG_DEFL_DEF_CTL) for more details on each field.
Table 13-3. Compression Engine Flags
202
Bits
Name
9
CHINT
8
SHINT
7
BHINT
6
DTREE
5:2
MAX_CHAIN_LENGTH
1
FORMAT
0
CONFIG
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
API
No “Flag” bits are defined for the decompression engine.
13.6.3 Status Registers
Once a compression transaction or decompression transaction is completed, the per-context
MICA_COMP_CTX_USER_CONTEXT_STATUS register provides the status of the operation: the number of bytes of the last transaction result, and the exception status, if any.
Decompression Exception Status
Table 13-4. Decompression Exception Status Register Description
Exception
Description
STS_BTYPE_ERR
BTYPE has an unsupported error type.
STS_NLEN_ERR
The number of data bytes in a non-compressed block is corrupted.
STS_SYMBOL_ERR
An unrecognized Huffman symbol is detected.
STS_GZ_CRC16_ERR
The GZIP header contains incorrect CRC16.
STS_GZ_CRC32_ERR
The GZIP payload contains incorrect CRC32.
STS_GZ_ISIZE_ERR
The number of uncompressed GZIP data bytes does not match the original raw
(uncompressed) data bytes.
13.6.4 Transaction Size
The following items are characteristics of the compression/decompression engines:
•
Different size. The compression engine and decompression engine typically produces data output that has a different size than the data input. The output size is usually smaller than input
size for compression, and the output size is usually larger than input size for decompression.
•
Large size. The maximum size of the compressed or uncompressed buffer is 64MB in each
transaction. The transactions are stateless, that is a flush is performed between the transactions.
•
Small size. Although the compression engine supports small packet sizes, small packets will
result in non-optimal performance. A threshold can be set so that smaller packets will not be
compressed and will be copied verbatim instead. For example, RFC 2394 refers to an implementation and users should not attempt to compress buffers smaller than 90 bytes.
13.6.5 Data Expansion Handling
In some circumstances, the output data from a compression operation will be larger than the
input data. This condition might occur for a random data stream (for example, a previously compressed or encrypted file), or for extremely small packet sizes. Be aware that you should not use
the expanded compressed data, but copy the original data verbatim instead. It is up to the user
application to choose either the uncompressed data or compressed data, based on the MICA_COMP_CTX_USER_CONTEXT_STATUS register. The source data buffer that is holding the
uncompressed data should be released only after the compression process is completed.
If the user application chooses to copy the data uncompressed, a 5-byte DEFLATE header must be
appended: one byte of 0x1, two bytes of LEN to indicate the number of bytes of uncompressed
data, and two bytes containing the one’s-complement of the LEN bitfield. Segmentation is necessary if the uncompressed data is greater than 64 K bytes, as the LEN bitfield has two bytes.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
203
Chapter 13 Compression Accelerator Interface
13.6.6 Performance Counter
Various performance counting events can be selected. For details, refer to the MICA_COMP_ENG_DEFL_PERF_CTL_0 and MICA_DEFL_PERF_CTL_1 registers.
204
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 14
F LEXIBLE I/O I NTERFACE
14.1 Overview
The flexible I/O interface configures and formats data for up to 64 data pins. These data pins can
be used to implement low-speed status and control bits, or to implement moderate speed asynchronous interface protocols such as HPI, ATA, etc. Each I/O pin can be individually configured
to be an input, output, or bidirectional pin with a number of drive and input options. The interface supports simultaneous use by multiple processes with full protection and virtualization
support. The virtualization support allows direct access to the interface by application-level programs with full process isolation and protection. MMIO transfers are used to configure the
interface and to supply and receive data from the I/O pins. An interrupt capability is supplied to
allow interrupts to be generated on any transition of a pin. The high-level structure of the interface is as shown in Figure 14-1.
MMIO Registers
Pin Format
64
RD DATA
RD DATA
I/O Pads
INT
64
OE
WR DATA
CMD
WR DATA
PROT MSK
64
DOUT
64
DIN
64
64
PINS
64
64
Figure 14-1: Flexible I/O Interface
14.2 Virtualization and Protection Support
The flexible I/O interface supports virtualization and protection using two related mechanisms.
The registers set of the interface is duplicated at eight different addresses, each set representing a
service domain. Associated with each service domain is a programmable privilege level and a pin
access mask. A MMIO register access is valid only if the privilege level of the service domain that
is being accessed is greater or equal to the required privilege level of the register.
If the register access is valid and is to a configuration register, then the access is performed. If the
access is invalid, an error is logged and the action is inhibited. Invalid MMIO writes are ignored,
and invalid MMIO reads return all zeros.
If the register access is legal and is to a pin manipulation register or interrupt vector register, then
the access is further filtered by the service domain pin access mask. If a MMIO write is occurring,
then only register bits are updated if the corresponding bit is set in the pin access mask. If the pin
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
205
Chapter 14 Flexible I/O Interface
access mask is clear for a given bit, then the update of that bit is ignored. For MMIO reads, the
data is read for bits that have the pin access mask set, and 0 is read for bits that have the pin access
mask clear.
14.3 MMIO Register Map
The flexible I/O interface register map contains eight identical copies of the MMIO registers. Each
copy corresponds to a different service domain as specified by the SVC_DOM field in the MMIO
address. The service domain attributes are used to control register protection and filtering of the
MMIO accesses. The registers that provide access to data for multiple I/O pins are further masked
by the service domain pin access mask. These registers are arranged into two groups: Pin access
registers (GPIO_PIN_STATE through GPIO_PIN_INPUT_CND) and Interrupt control registers
(GPIO_INT_VEC0_W1TC through GPIO_INT_VEC1_RTC). The full map is shown in the
GPIO_HTML.
14.4 Interrupts
The flexible I/O interface supports interrupts on the rising of falling transition of each I/O pin.
The interrupts are specified on a per-pin basis, and can be vectored to any tile event and interrupt
vector number. The binding of the per-pin interrupts are specified by the GPIO_INT_BIND register. Interrupts can be dismissed using either MMIO read or MMIO write accesses as the
application requires. The GPIO_INT_VEC0_W1TC and GPIO_INT_VEC1_W1TC registers allow
reading of the interrupt state, but only dismissing a subset of the pending interrupts, while reading the GPIO_INT_VEC0_RTC or GPIO_INT_VEC1_RTC registers returns the currently-pending
interrupts and clears all of them. In this case the application will need to process all interrupts
without leaving any pending.
14.5 I/O Pin Driver Configuration
Each pin can be individually configured to be an input, output, or bidirectional pin with or without internal pullup/pulldown resistors. The output pins can support either normal or open-drain
drivers and support output drives from 4 mA to 12 mA with optional slew-rate control. The input
pins can be configured to be normal inputs or Schmitt trigger inputs.
14.6 I/O Pin Clocking Control
All input pins are sampled, and all output pins and enables are applied synchronously to an internally generated clock named GCLK. The interface configures the source and frequency of GCLK in
the GPIO_GCLK_MODE register. The interface will guarantee that two sequential register accesses
will be performed on different GCLK edges. The GCLK can be generated from two different clock
sources: CCLK divided by 2 and CORE_REF_CLK, and can be further divided by the DIVIDE
parameter of the GPIO_GCLK_MODE register.
Changing the GCLK period permits an application that is supplying data and a strobe to guarantee
at least one GCLK period of setup and hold for the pin accesses of a data bus and a data strobe. For
example, a write of a set of data bits followed by a write asserting the strobe will guarantee that
the data bits are stable for one GCLK cycle before the strobe occurs. This also guarantees that the
next write to the data will not occur until one GCLK cycle after the assertion of the strobe. Additionally on the input side, the application can read the pin state waiting for the assertion of the
strobe, and can perform a separate read of the data bus if the external device does not supply sufficient setup time. This access pattern guarantees that the data bus will be sampled at least one
GCLK period after the appearance of the strobe.
206
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Pin Control and Data Accesses
One issue with a slow GCLK period is that there is limited buffering for MMIO transactions in the
flexible I/O interface itself; exceeding this limit can cause performance issues with other transactions on the iMesh. The flexible I/O interface can buffer eight MMIO transactions, so that
applications that use a slow GCLK frequency and need to send more than eight MMIO transactions without waiting for results will need to employ some kind of flow control. Flow control can
be achieved by insertion of a MF instruction every few MMIO accesses to the flexible I/O interface, which will cause the processor to wait until the previous MMIO accesses have completed
before issuing further MMIO accesses. Alternatively, the program can perform a MMIO read
request and use the returned data. By definition that implies that all previous MMIO accesses
have completed. This restriction is not that severe because most applications do not require slow
GCLK frequencies, and would normally do MMIO read accesses before sequential MMIO write
accesses.
14.7 Pin Control and Data Accesses
The direction of the I/O pin interface can be configured by programming the GPIO_PIN_DIR_I
and GPIO_PIN_DIR_O registers. The supported modes are input-only, output, and open drain.
Pins in the output mode can be used as bidirectional pins by disabling (releasing) the driver without changing the mode. Input pins can select either a normal receiver or the Schmitt trigger input
in the GPIO_PAD_CONTROL register.
The input state of any pin can be ascertained by reading the GPIO_PIN_STATE register. The value
read can be conditionally inverted to support active-low signals by specifying the contents of
GPIO_PIN_INPUT_INV. The input can also be configured to have additional two-level synchronization to avoid noise sampling problems in the GPIO_PIN_INPUT_SYNC register.
The output value of a pin can be specified through a number of registers, based on the application
requirements. The output value can be conditionally inverted by programming the GPIO_PIN_OUTPUT_INV register. The output driver can be disabled so that the pin can be used as an
input on a bidirectional bus by writing the GPIO_PIN_RELEASE register. Output pulses with a
duration of a single GCLK cycle can be created by writing the GPIO_PIN_PULSE_SET and GPIO_PIN_PULSE_CLR registers. Normal output data is written by using the GPIO_PIN_STATE
register, but a few special operations are implemented using additional data registers. These
additional operations permit clearing, setting, or toggling the output register. See the GPIO_PIN_xxx register documentation for details, starting at GPIO_PIN_STATE and ending at
GPIO_PIN_INPUT_CND.
14.8 Reset/Initialization
At reset, all I/O pins are configured to input pins with the output driver disabled, and the GCLK
frequency is reset to be the CORE_REF_CLK.
14.9 Performance
The maximum rate of data transitions of the I/O pins is limited by the GCLK frequency. The maximum frequency of GCLK is the maximum frequency of CORE_REF_CLK and half the frequency of
core clock.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
207
Chapter 14 Flexible I/O Interface
208
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 15
R SHIM I NTERFACES
The Rshim contains chip-level services for boot and debug. It also hosts a number of the low
speed interfaces such as UARTs, I2C-Masters, I2C-Slave, and SPI. The Rshim provides a register
interface that is accessible from the locally-hosted devices, Tile software, and remote devices such
as PCIe, JTAG, and USB.
Thus the generic boot and debug services hosted by the Rshim are available both for Tile software
and for external agents connecting through USB, PCIe, UART, or I 2C.
The sections below describe a number of the Rshim’s services. Additional services are described
within the Rshim register documentation.
15.1 Level-1 Boot
Level-1 boot is achieved by sending boot data on the UDN to a target Tile. The Rshim provides
access to the UDN via the packet generator (RSH_PG_CTL, RSH_PG_DATA registers). Thus any
device that has access to Rshim registers can provide the level-1 boot stream to the Tiles.
The packet generator is reset to a state that directly sends writes to the RSH_PG_DATA register
onto the UDN. A 4KB boot buffer provides elasticity to the boot stream. An external agent providing the boot stream can read the SENT_COUNT field in the RSH_PG_CTL register to determine how
much boot data has been sent to the Tile and hence how much space is available in the 4KB boot
buffer.
By preventing the boot buffer from filling, the external agent can insure that accesses to other
Rshim registers will never be blocked and hence even a wedged level-1 boot process can be
debugged via access to other Rshim registers (RSH_JTAG_CONTROL or RSH_RESET_CONTROL for
example).
15.2 I/O Discovery
The Rshim provides registers that are used for generic I/O device discovery software. These
include the RSH_FABRIC_CONN, RSH_FABRIC_DIM, and RSH_IPI_LOC registers. Tile software can
read these registers to determine how the mesh is configured and where various I/O components
are attached. Additionally, the RSH_TILE_COL_DISABLE registers indicate which Tiles within the
RSH_FABRIC_DIM limits are not available even though their mesh switch is present.
The device discovery process is described in more detail in Section 1.1.5 Device Discovery on
page 3.
15.3 tile-monitor FIFOs
The Rshim provides two FIFOs for general purpose communication between Tile software and an
external agent, or “host”. The intended use is for one FIFO to be dedicated for host-to-Tile communication and the other FIFO to be dedicated for Tile-to-host communication. Each FIFO has
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
209
Chapter 15 Rshim Interfaces
associated high-water-mark and low-water-mark interrupts. Although the FIFOs are general purpose, the USB endpoint interface does make assumptions about how they are used in the case
where its tile-monitor capability is enabled.
The tile-monitor FIFOs are accessible via the RSH_TM_HOST_TO_TILE and RSH_TM_TILE_TO_HOST registers.
15.4 Down-Counters and Watchdog
The Rshim provides three independent 48-bit down-counters that can be connected to interrupts
or used as watchdogs to initiate chip resets. The counters can be externally-referenced to a realtime clock source or can be timed based on REFCLK_CORE. For more information about clock
inputs, refer to the appropriate TILE-Gx™ data sheet.
The down counters are contained in the RSH_DOWN_COUNT registers (for example the
RSH_DOWN_COUNT_CONTROL register).
15.5 Rshim JTAG
Rshim JTAG provides a mechanism to access Tilera-specific JTAG registers inside the TILE-Gx™
processor. These registers are typically used to access diagnostics features within the Tile. The
Rshim JTAG controls are located in the RSH_JTAG registers (for example the RSH_JTAG_CONTROL
register).
15.6 Reset Control
The TILE-Gx processor can be partially or completely reset using the RSH_RESET_CONTROL,
RSH_RESET_MASK, and RSH_IO_RESET registers. These registers allow for individual I/O devices
to be reset without impacting the reset of the device. Or an I/O device can be left intact while
resetting the rest of the device. This feature is useful when the part needs to be reset to load new
software but a particular I/O port needs to remain “up” while the chip is being reset.
The RSH_BREADCRUMB and RSH_SCRATCH_BUF registers retain state during software reset so they
can be used to indicate status to the rebooted system (POST failures for example).
15.7 Byte Access Interface
Some external hosts that map directly into Rshim register space through PCIe can only support
32-bit accesses. In this case, the RSH_BYTE_ACC registers (for example the RSH_BYTE_ACC_CTL
register) provide an indirect-access mechanism that allows atomic access to TILE-Gx’s 64-bit register space. This mechanism is not required for UART, I 2C, or USB hosts, since those interfaces
define a 64-bit access mechanism through the host’s API.
15.8 Remote Interface Access and Device Protection
Devices not directly attached to the Rshim, such as PCIe and USB, use a dedicated on-chip interconnect to access Rshim register space. This allows chip-level boot and debug without
interference with the mesh networks. The remote devices are defined in the RSH_DEVICE_PROTECTION register. This register can be used to block access by remote devices or locally-attached
Rshim devices.
210
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 16
UART I NTERFACES
16.1 UART Interface
16.1.1 Overview
The UART controller is used to communicate between the tiles and external device (via the two
UART serial bits). The UART controller uses a 256-byte transmit FIFO and a 256-byte receive
FIFO. Figure 16-1 provides a graphical view of the UART interface.
RefClk
Register
Tile/RSHIM
Interface
Write FIFO
FSM
TX FIFO
UART
Protocol
Controller
serial_tx
Remote
UART
serial_rx
RX FIFO
Figure 16-1: UART Interface Block Diagram
The UART interface can operate in two modes: interrupt mode and protocol mode.
In interrupt mode, the UART interface provides a typical transmit/receive interface between an
external device and on-chip tiles. Data written to the write FIFO by a tile is transferred to an internal transmit FIFO and then transmitted out the serial transmit output. Data received on the serial
receive input is transferred to the receive FIFO, which the tile can then read.
In protocol mode, the UART interfaces provides an external devices with the ability to read or
write any register in any Rshim device. Bytes received via the serial receive input are interpreted
as register reads or write commands. Read responses are transmitted via the serial transmit
output.
16.1.1.1Protocol Mode
Protocol mode can be enabled either via a boot strap pin or through UART configuration registers. In order to use protocol mode, the UART must be configured to use 8-bit wide data.
A typical usage of protocol mode provides boot code to the chip by writing to the Rshim STN
Data register. Because it provides access to any register in any Rshim device, it can also be useful
for diagnostic purposes.
In protocol mode, the bytes received are interpreted by the UART interface as “segments” where
each segment can write to or read from any register in any Rshim device. The format of a segment
is described in Figure 16-2. Each segment has four bytes of header and n bytes of data.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
211
Chapter 16 UART Interfaces
7
6
4
0
bytes[7:0]
dest[4:0]
bytes[10:8]
dest[12:5]
Read
dest[15:13]
channel[3:0]
segment 0
Data 0
Data 0
. . .
bytes[7:0]
dest[4:0]
bytes[10:8]
dest[12:5]
Read
channel[3:0]
dest[15:13]
segment n
Data n
Data n
Figure 16-2: Protocol Mode
Table 16-1. Protocol Mode Format Descriptions
Bitfield
Description
bytes[10:0]
Specifies the transfer size in bytes (excluding the fixed 4 bytes of header).
1
n
Transfer 1 byte
Transfer n bytes
Note:
For a read request, transfer size must be 1 byte or 8 bytes.
For a write request, transfer size must be 1 to 8 bytes (bytes = 1/2/3/4/5/6/7/8).
A boot load is a special write request, where the transfer size must be multiple of 8 bytes.
channel[3:0]
Channel number in the Rshim.
dest[15:0]
Destination address in the Rshim. Lower 3 bits indicates the byte offset.
read
Specifies the direction of the transfer.
1
0
data
212
Read request (for example external UART reads an address in Rshim)
Write request (for example external UART writes an address in Rshim)
Data. Note that a read request does not have a data field (A read data transmits in the opposite
direction.).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
UART Interface
16.1.2 Data Flows
16.1.2.1Receiving Data
In interrupt mode, the tile reads the UART receive data, one byte at a time. The transfer size of
each read command is always 1 byte, that is each read (UART_RECEIVE_DATA register) returns 1
byte of receive data in bits[7:0] of the data bus on the register interface. There are two usage models used for receiving data.
•
Interrupt
The UART_RECEIVE_DATA register can be read when the UARTSH_RFIFO_WE or UARTSH_RFIFO_AFULL interrupt is generated (refer to the UARTSH interrupt status register),
which means there is receive data available. Tile software can clear the interrupt bits, read
polls the UART_FIFO_COUNT register to determine how many available entries in the receive
FIFO.
•
Status Polling
Software first polls the UART_FIFO_COUNT register. When the receive FIFO is not empty,
issues the Read command(s) accordingly.
16.1.2.2Transmitting Data
The Rshim writes the UART transmit data (UART_TRANSMIT_DATA register), one byte at a time.
The transfer size of each write command is always 1 byte on bits[7:0] of the data bus on register
interface, that is each write (UART_TRANSMIT_DATA register) command delivers 1 byte of write
data. Once the write data is written, the transmit data will be sent to the external device as long as
there is no pending transmit data. One of two usage models can be used to issue transmit data.
•
Interrupt
Software can issue up to two write (UART_TRANSMIT_DATA register) commands initially. The
transmit data register can be written again when the UARTSH_WFIFO_RE interrupt is
asserted, which means one transmit data has been consumed (by the transmit FIFO). Software
can clear the interrupt bit, then it polls the UART_FIFO_COUNT register to determine how
many write (UART_TRANSMIT_DATA register) command to issue.
•
Status Polling
Software polls the UART_FIFO_COUNT register. When the write FIFO is not full, the software
issues the write command(s) accordingly.
16.1.3 Flow Control
•
Hardware CTS (clear to send)/RTS (request to send) flow control is NOT supported by the
UART interface.
•
XON/XOFF style flow control can be implemented in the higher level software.
16.1.4 Master Arbitration
All registers in the UART Interface can be accessed by both tiles and an external device. However,
there are two exceptions to normal register read/write access.
•
The UART_ELECTRICAL_CONTROL register is not writable by the external UART (the write
request is simply dropped). This is the only register that is writable by a tile but not by an external device.
•
The UART_RECEIVE_DATA register can be read by an external device. However, the data entry
will not be popped from the receive FIFO (for example, it will remain in the receive FIFO).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
213
Chapter 16 UART Interfaces
In case both the external UART and the tiles want to write to the same register, the write request
from the external UART will have a higher priority. The write request from the tiles will be
queued and be served later.
16.1.5 8/64 Bits Handling
In the protocol mode, there is 8/64 bits handling logic. The remote UART is byte oriented, where
registers in the UART Shim and the Rshim are word oriented.
16.1.5.1Remote UART Writes
Because byte mask is not available, remote UART write requests should be performed in 8 bytes
quantities, starting from byte address 0 (for example in the sequence of byte 0, byte 1, byte 2, byte
3, …, byte7). Partial write will be filled with 0s, for example if writes are performed to byte 1 and
byte 2 only, then the rest of the bytes will be filled with 0s. Unaligned dest[2:0] is supported,
but the ending address of a write should not cross the 8 byte boundary. If it does, it is considered
as illegal and write data will be written to a wrapping address, for example if dest[1:0]=1 and
bytes=8, then first seven bytes of UART write data will be written to byte address 1, byte address
2, byte address 3, .., byte address 7. The eighth byte of UART write data will be written to byte
address 0 (wrapping address).
16.1.5.2Remote UART Reads
Remote UART must perform read request either in one byte or in eight bytes at a time in the protocol mode. Unaligned dest[2:0] is supported. Some read request may have a side effect, for
example, a read request may trigger a FIFO “pop” operation. As such, the bytes field must be
configured accordingly.
16.1.6 Error Handling and Interrupts
When error conditions occur, they are logged in the UART_INTERRUPT_STATUS register. When
this occurs, an interrupt can be sent to a tile depending on the mask setting (in the UART_INTERRUPT_MASK register) and the credit state of the binding being used. The UART interface uses the
four shared Rshim bindings to determine the destination tile for interrupts and each interrupt can
be programmed to use any of those bindings. See the UART interrupt register (UART_MODE register) definitions for more information about the specific errors that are handled.
16.1.7 UART Controller Registers
For detailed descriptions of the UART registers, refer to uart.html.
214
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 17
I 2 C M ASTER I NTERFACE
17.1 Overview
I2C Master Interface provides an interface for tiles to write and read an external I2C devices. It
also includes a hardware state machine that can read from an external EEPROM at boot and then
write to any Rshim device register. External I2C devices can be classified as the following three
groups:
•
Generic I2C port (for example: on a controller)
•
Bootable I2C-based serial EEPROM (for example: boot ROM)
•
Non-bootable I2C based serial EEPROM (for example: the Serial Presence Detect (SPD)
EEPROM on DDR2 DIMM)
I2C Master Interface supports serial EEPROM up to 1M bits. If the serial EEPROM is used as a
boot ROM, then the size must be in the range of 32K bits to 1M bits; that is EEPROM must have a
16-bit word address in addition to the 7-bit device ID. I2C Master Interface does not support clock
stretching.
125MHz
rshim
Register
Interface
100 to 400KHz
Write FIFO (wfifo)
Read FIFO (rfifo)
FSM
I2C
Interface
External I2C Devices
(For example: Serial EEPROM)
Host
Interface
rshim
Buffer FIFO (bsfifo)
Figure 17-1: I2C Master Interface Block Diagram
17.1.1 I 2 C Master Boot Options
The following boot options are available:
•
Boot request type. Both hard boot request and soft boot request (via special boot instruction)
are supported. The hard boot request is typically asserted after power on reset and when system is ready (such as when a clock is stable).
•
Boot ROM type. A bootstrap pin determines if it is booted from SPI-based flash or I2C-based
EEPROM. The software boot instruction determines the boot type afterwards.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
215
Chapter 17 I2C Master Interface
•
Boot code destination. A boot code segment can be sent to different destinations. For example,
the first segment can be sent to a UART device in the Rshim, the second segment can be sent to
an I2C device in the Rshim, and the rest of segments can be sent to tiles.
17.1.2 Boot ROM Format
Figure 17-2 presents the boot ROM format, which is comprised of three regions, which includes
the following characteristics:
·
An eight-byte ROM header.
·
One or multiple program segments. In each program segment, there is an eight byte segment
header followed by program code (in eight byte quantities).
·
A user data region.
rsvd0
e
rsvd1
rev_id
dest
rsvd1
header
word
Boot Code 0
segment 0
Boot Code 0
. . .
e
rsvd1
dest
rsvd1
word
Boot Code n
segment n
Boot Code n
Data
Figure 17-2: I2C Master Boot ROM Format
Table 17-1. I2C Master Boot ROM Format
Field
rev_id[7:0]
Description
Revision of the BOOT ROM. This ID will be stored in a Boot Revision ID (I2CM_-
BOOT_REVISION_ID) register after the boot.
rsvd0[55:0]
Reserved.
rsvd1[14:0]
Reserved.
word[16:0]
Specifies the number of (8B) words in the segment, including the segment header (e,
dest, word).
1
n
216
1 word = 64 bits
nwords = 64 n bits
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Usage Model
Table 17-1. I2C Master Boot ROM Format (continued)
Field
Description
dest[16:0]
Specifies the destination of the segment.
[16:13]
[12:0]
channel number
word address number
e
1
Specifies that this segment is the end of the boot code.
Boot Code
Only the boot code portion will be forwarded to the specified destination. The format of
boot code is transparent to the I2C Master Interface. I2C Master Interface does not interpret the boot code.
Data
User data section of the ROM will not be processed by the controller.
17.1.3 Boot Operations
A boot sequence starts when a hard or soft boot request is received by the I2C master interface.
I2C master shim always boots from address 0 in the boot ROM. The I2C master interface will read
the boot header. The I2C Master Interface then reads boot code segments until the last segment is
finished. Only the program portion will be forwarded to the destination. The ROM header and
the segment header(s) are not forwarded and are used by the I2C Master Interface.
A boot strapping pin, CONFIG_I2C[0], selects the I2C device address from which to boot. This
feature can be used when the chip boots from an I2C boot ROM, which has an address conflict
with the SPD ROM on the DDR3 DIMMs.
•
CONFIG_I2C[0] = 1: boot from the ROM located at device address 1010_100.
•
CONFIG_I2C[0] = 0: boot from the ROM located at device address 1010_000.
Note: For more information about the CONFIG_I2C[0] pin, refer to the appropriate data sheet for
your processor.
17.2 Usage Model
17.2.1 Generic Operation
Generic read and write operations are supported on the I2C interface.
The following figures show the read and write operations on the I2C interface.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
217
Chapter 17 I2C Master Interface
WORD ADDRESS [7:0]
REPEATED
START
WRITE
START
SLAVE ADDRESS
DUMMY WRITE
ACK
LSB
MSB
ACK
R/Wn
LSB
MSB
DATA1
DATA2
STOP1
READ
REPEATED
START
SLAVE ADDRESS
DATAn
NACK
LSB
MSB
ACK
LSB
MSB
ACK
LSB
ACK
MSB
R/Wn
LSB
MSB
Figure 17-3: 8-Bit Address Read1
WORD ADDRESS [15:8]
WORD ADDRESS [7:0]
REPEATED
START
WRITE
START
SLAVE ADDRESS
DUMMY WRITE
ACK
LSB
DATA2
STOP
READ
DATA1
MSB
ACK
LSB
MSB
ACK
R/Wn
LSB
MSB
REPEATED
START
SLAVE ADDRESS
DATAn
NACK
LSB
MSB
ACK
LSB
MSB
ACK
LSB
MSB
ACK
R/Wn
LSB
MSB
Figure 17-4: 16-Bit Address Read
1. Refer to I2CM_NO_STOP for more details.
218
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Usage Model
DATA1
DATA2
STOP
READ
REPEATED
START
SLAVE ADDRESS
DATAn
NACK
LSB
MSB
ACK
LSB
MSB
ACK
LSB
MSB
ACK
R/Wn
LSB
MSB
Figure 17-5: Read without Address
WORD ADDRESS [7:0]
STOP1
WRITE
START
SLAVE ADDRESS
DATAn
DATA1
NOACK
ACK
LSB
MSB
ACK
LSB
MSB
ACK
R/Wn
LSB
MSB
Figure 17-6: 8-Bit Address Write1
WRITE
START
SLAVE ADDRESS
WORD ADDRESS [15:8]
LSB
ACK
ACK
MSB
LSB
MSB
ACK
R/Wn
LSB
MSB
STOP
DATA1
WORD ADDRESS [7:0]
DATAn
NOACK
LSB
LSB
ACK
MSB
Figure 17-7: 16-Bit Address Write
1. Refer to I2CM_NO_STOP for more details.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
219
Chapter 17 I2C Master Interface
DATA1
STOP
WRITE
START
SLAVE ADDRESS
DATAn
NOACK
LSB
ACK
LSB
MSB
ACK
R/Wn
LSB
MSB
Figure 17-8: Write without Address
17.2.2 Software Instructions
The I2C master interface supports the following instructions:
Table 17-2. Supported Instructions
Instruction
Description
Instruction format
BOOT
Boot
10
READ
Read
00
WRITE
Write
01
Software Commands to Execute I2C EEPROM Instructions
There are three steps to executing a read or write operation listed in Table 17-2:
1.
Program the I2C Address (I2CM_ADDRESS) register and program the I2C Byte (I2CM_BYTE)
register.
2.
Program the I2C Instruction (I2CM_INSTRUCTION) register to start the operation.
3.
Start data handling by either writing to the I2C Write Data (I2CM_WRITE_DATA) register or by
reading from the I2C Read Data (I2CM_READ_DATA) register.
To program the I2C Instruction (I2CM_INSTRUCTION) register, software needs to make sure that
the I2C interface is idle. There are two usage models:
•
Interrupt
When the I2C_INST_EXEC interrupt (from the I2CM_INT_VEC_W1TC register) is asserted, it
means the previous instruction is done. Software first clears the I2C_INST_EXEC interrupt
bit, and then writes to I2C Instruction (I2CM_INSTRUCTION) register.
•
Status Polling
Software can poll the I2C_BUSY bit of the I2CM_FLAG register, and write to the I2C Instruction (I2CM_INSTRUCTION) register when the I2C_BUSY=0.
Write data will be programmed by writing to the I2C Write Data (I2CM_WRITE_DATA) register.
Each write pushes one entry to the write FIFO. To write to the I2C Write Data (I2CM_WRITE_DATA) register, software makes sure that the write FIFO is not full. There are two usage models:
•
220
Interrupt
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Usage Model
When the I2C_WFIFO_READ interrupt (of the I2CM_INT_VEC_W1TC register) is asserted, it
means previous write data has been processed. Software first clears the I2C_WFIFO_READ
interrupt bit. Software then writes to the I2C Write Data (I2CM_WRITE_DATA) register based
on the status of the write FIFO (I2CM_FLAG register). For example, the write FIFO is full if
I2C_WFIFO_FULL=1; the write FIFO has one data entry if I2C_WFIFO_FULL=0 and I2C_WFIFO_EMPTY=0; the write FIFO is empty if I2C_WFIFO_EMPTY=1.
•
Status Polling
Software can poll the I2CM_FLAG register, and write to the I2C Write Data (I2CM_WRITE_DATA) register when the write FIFO is not full.
Read data will be returned by reading the I2C Read Data (I2CM_READ_DATA) register. Each read
returns (pops) one entry from the read FIFO. To read from the I2C Read Data register, software
makes sure that the read FIFO is not empty.
There are two usage models:
•
Interrupt
When the I2C_RFIFO_WRITE interrupt is asserted, it means new read data is available. Software first clears the I2C_RFIFO_WRITE interrupt bit. Software then reads the I2C Read Data
(I2CM_READ_DATA) register based on the status of the read FIFO (I2CM_FLAG register).
•
Status Polling
Software can poll the I2CM_FLAG register, and read from the I2C Read Data (I2CM_READ_DATA) register when the read FIFO is not empty.
Table 17-1 presents examples on how an I2C instruction will be implemented by a sequence of
software reads and writes. For simplicity, the detailed steps of how to write the I2C Instruction
register is marked as Write (I2C_INSTRUCTION). The detailed steps of how to write the I2C Write
Data (I2CM_WRITE_DATA) register is marked as Write (in the WDAT bit). The detailed steps of how
to read the I2C Read Data (I2CM_READ_DATA) register is marked as Read (in the RDAT bit)).
Table 17-1. Examples of How an I2C Instruction is Implemented (for Reads and Writes)
Instruction
Software Command Sequence
WRITE
(8/16-bit address)
READ
(8/16-bit address)
a
• Write(I2CM_ADDRESS) = starting byte address to program (This step can be skipped if
ADDR_SEL.ADDR_DIS=1)
• Write(I2CM_BYTE) = desired transfer size
• Write(I2CM_INSTRUCTION) = ‘b01
• Issue multiple Write(I2CM_WRITE_DATA) according to the BYTE. One Write(I2CM_WRITE_DATA) contains eight bytes of desired write value. Note: WDAT[7:0] contains data for the byte
address 0, WDAT[63:56] contains data for the byte address 7.
• Write(I2CM_ADDRESS) = starting byte address to read (This step can be skipped if ADDR_SEL.ADDR_DIS=1.)
• Write(I2CM_BYTE) = desired transfer size
• Write(I2CM_INSTRUCTION) = ‘b00
• Issue multiple Read (I2CM_READ_DATA) according to the BYTE. One Read (I2CM_READ_DATA) returns eight bytes of read data.
Note: I2CM_READ_DATA[7:0] contains data for the byte address 0, I2C_READ_DATA[63:56]
contains data for the byte address 7.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
221
Chapter 17 I2C Master Interface
Table 17-1. (continued)Examples of How an I2C Instruction is Implemented (for Reads and Writes)
Instruction
Software Command Sequence
Write (32-bit
address)
Disable address phase (ADDR_SEL.ADDR_DIS=1).
• Write(I2CM_BYTE) = 4 (that is, 4 bytes of address) + desired transfer size
• Write(I2CM_INSTRUCTION) = ‘b01
• Issue multiple Write(I2CM_WRITE_DATA) according to the BYTE. One Write(I2CM_WRITE_DATA) contains eight bytes of desired write value. The first Write(I2CM_WRITE_DATA) contains the four bytes of address. Note: WDAT[7:0] will be sent out first.
Read (32-bit
address)
Disable address phase (ADDR_SEL.ADDR_DIS=1).
• Write(I2CM_BYTE) = 4 (that is, 4 bytes of address)
• Write(I2CM_INSTRUCTION) = ‘b01 (this initiates the dummy write operation)
• Issue one Write(I2CM_WRITE_DATA), which contains four bytes of address during the dummy
write operation, WDAT[7:0] will be sent out first.
• Write(I2CM_BYTE) = desired transfer size
• Write(I2CM_INSTRUCTION) = ‘b000
• Issue multiple Read (I2CM_READ_DATA) according to the BYTE. One Read (I2CM_READ_DATA) return eight bytes of read data.
a.Refer to the I2CM_ADDR_SEL register for more details on 8/16-bit address, and address disable.
17.2.3 I 2 C EEPROM Page Mode
The I2C EEPROM has a concept of page size, which will improve write (program) timings. If the
transfer size crosses the page size boundary, EEPROM typically wraps around the address, data
contends can be overwritten in an implementation dependent way. The I2C Master Interface will
partition the transfer into pages, if the transfer size (I2CM_BYTE register) crosses the page size
(I2CM_PAGE_SIZE register).
17.2.4 Error Handling and Interrupts
It is up to the software to maintain correct boot ROM format, that is, correct encoding to specify
the length of the boot code. It is up to software to maintain correct protocol in order to execute the
I2C-based serial EEPROM instructions, that is, program I2CM_ADDRESS/I2CM_BYTE before programming I2CM_INSTRUCTION, read and program EEPROM must be in 4B quantities. It is up to
software to issue recognizable instructions.
Error conditions will be logged. For more information on how to log errors, refer to I2CMS interrupt registers. Interrupts will be sent to a binding tile once an error condition is encountered; the
binding is defined by the Rshim interrupt binding register (RSH_INT_BIND).
17.3 Registers
For detailed descriptions of the I2C Master registers, refer to i2cm_html.
222
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 18
I 2 C S LAVE I NTERFACE
18.1 Overview
I2C Slave Interface, illustrated in Figure 18-1, is the interface to an external I2C device, where
external I2C device is the initiator (master) and the I2C Slave Interface is the target (slave). The I2C
interface supports Standard-mode, Fast-mode, and Fast-mode plus. The I2C slave controller supports clock stretching.
For more information, refer to the I2C specification.
125MHz
100/400KHz
Transmit FIFO
FSM
TILEs/RSHIM
I2C
Interface
External I2C Devices
(for example: Bus Controller)
Receive FIFO
Buffer FIFO
Figure 18-1: I2C Slave Interface Block Diagram
18.2 Usage Model
18.2.1 Data Flows
An external I2C master device can read and write the address space in the I2C Slave Interface.
An external I2C master device can read and write the address space in a Rshim device (other than
the I2C Slave Interface) via the “host interface”. For example, boot code can be pushed to the
RSH_PG_DATA register in the Rshim. The I2C Slave controller uses the buffer FIFO to flow control
the access to a Rshim device. The buffer FIFO is not software-addressable.
A Rshim device can read and write the address space in the I 2C Slave Interface via the “register
interface”, where Rshim is the initiator (master). Note that the Rshim cannot initiate requests to
external I2C devices; the Rshim can only respond to external I2C devices.
The I2C Slave interface implements two software addressable FIFOs to assist data passing.
•
I2C slave interface “pull” model: An external I2C master device can pass data to a tile via the
receive FIFO, that is data is written to the receive FIFO by an external device and data is read
from the receive FIFO by a tile. The receive FIFO is software addressable. A receive FIFO write
event interrupt is provided, along with other receive FIFO status flags.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
223
Chapter 18 I2C Slave Interface
•
I2C slave interface “push” model: A tile can pass data to an external I2C master device via the
transmit FIFO, that is data is pushed to the transmit FIFO by a tile and data is popped from the
transmit FIFO by an external I2C master device. The transmit FIFO is software addressable. A
transmit FIFO read event interrupt is provided, along with other transmit FIFO status flags.
18.2.2 Direct-Addressing
The I2C Slave Interface is configured to have a 7-bit I2C slave address (I2CS_SLAVE_ADDRESS)
register. The I2C Slave Interface responds to the bus only if this I2C slave address matches the targeted I2C device.
If the 7-bit slave address matches, direct addressing is applied on the 16-bit address by the I2C
Slave controller. Bit assignments of the 16 bits address are shown in Table 18-1.
Table 18-1. I2C Slave Bit Assignments
Bits
Description
15:12
Rshim Channel number[3:0]
11:3
(word) Register number[8:0]
2:0
Byte address
If the I2C Slave controller is selected (that is channel number matches with the I2C Slave controller), the total addressable space is 2K bytes.
I2C slave controller addressing = {register number [8:0], byte address [2:0]}
If a Rshim device (other than the I2C slave controller) is selected, the total address space becomes
1 M bytes. The register number can be extended by 4 bits via the I2CS_RDEV_ADDR register.
Rshim device addressing =
{channel number [3:0], I2CS_RDEV_ADDR[3:0], register number [8:0]}
18.2.3 No-Address Access
The I2C Slave Interface is configured as direct-addressing by default. The address phases can be
disabled via the I2CS_ADDR_PHASE register, where the I2C Slave Interface is considered an
address-less device.
In the “no-address” mode, an external I2C master device will push all data (note that all payload
is treated as data, no address) to the receive FIFO and an RFIFO_WRITE interrupt will be raised
when an entry is written to the receive FIFO. It is up to the TILE side software to read from the
receive FIFO in a timely fashion.
An external I2C master device will read all data from the transmit FIFO and an TFIFO_READ
interrupt will be raised when an entry is read from the transmit FIFO. It is up to the TILE side
software to write to transmit FIFO in a timely fashion.
18.2.4 8 Bits / 64 Bits Handling
The I2C bus is byte-oriented, where registers in the I2C Slave Interface and the Rshim are 8-byte
word-oriented. Because byte mask is not available, I2C write operations should be performed in 8byte quantities, starting from byte address 0 (for example, in the sequence of byte 0, byte 1, byte 2,
… byte 7). In case of a partial write (where the transfer size is not in 8-byte quantities), 0s will be
224
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Usage Model
filled in (for example, if writes are performed to byte 0 and byte 1 only, then the other bytes will
be filled with 0s). The I2C Slave Interface uses a bytes address to assemble multiple of 8 bits I2C
write data into a 64-bit internal write data.
There is no restriction on I2C read operations. The I2C Slave Interface uses byte address to steer
the 64 bits of internal read data and assembles the 8 bits I2C read data.
18.2.5 Acknowledge Control
The I2C Slave Interface supports clock stretching. When the I2C Slave interface is not able to
respond with read data in time, it can stretch the clock. Some external I2C master devices might
not support clock stretching.
Normally, the ack/nack indicates whether the I2C Slave interface acknowledges the data or not.
The I2C Slave Interface provides a register to control how the read acknowledgment and the write
acknowledgment are handled. Refer to the I2CS_ACK_CTL register for more details.
18.2.6 Access Arbitration
Local registers in the I2C Slave Interface can be read by the Rshim and the external I 2C master
device at the same time. If the Rshim and the external I2C master device try to write to the I2C
Slave Interface at the same time, a higher priority is given to the external I2C master device (Note
that requests from a low speed I2C master device can not be sustained; the Rshim will be able to
write the I2C Slave Interface shortly afterwards).
18.2.7 Error Handling and Interrupts
Error conditions will be logged, refer to interrupt status and mask (I2CS_INT_VEC_MASK) registers on how to log errors. Interrupts will be sent to a binding tile once error condition
encountered. The interrupt binding is handled by RSH_INT_BIND register in the Rshim.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
225
Chapter 18 I2C Slave Interface
226
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
C HAPTER 19
SPI I NTERFACE
19.1 Overview
The SPI SROM interface provides an interface for tiles to write and read an off-chip SPI SROM. It
also includes a hardware state machine that can read from the SPI SROM at boot time and then
write any Rshim device register. External SPI SROM sizes, between 512K and 128M bits, are
supported.
125 MHz
Register
Interface
RSHIM
15.625 MHz
Write FIFO
FSM
SPI
Interface
External
SPI ROM
Read FIFO
Host
Interface
RSHIM
Buffer FIFO
Figure 19-1: SPI SROM Interface Block Diagram
19.1.1 Boot Options
The following boot options are available.
•
Boot request type
•
Both hard boot requests and soft boot requests (via special boot instruction) are supported. The
hard boot request is via a bootstrap pin.
•
Boot code destination
•
A boot code segment can be sent to different destinations, for example, the first segment is sent
to a UART controller in the Rshim; the second segment is sent to an I2C device in the Rshim,
and the rest of segments are sent to tiles.
19.1.2 Boot ROM Format
Figure 19-2 presents the boot ROM record format. The boot ROM is comprised of three regions:
•
An eight-byte ROM header.
•
One or multiple program segments. In each program segment, there is an eight-byte segment
header followed by program code (in eight-byte quantities).
•
A user data region.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
227
Chapter 19 SPI Interface
R
e
rsvd
rev_id
dest
header
word
Boot Code 0
segment 0
Boot Code 0
. . .
e
rsvd
dest
word
Boot Code n
segment n
Boot Code n
Data
Figure 19-2: SPI Boot ROM Format
Table 19-1. SPI SROM Format Description
Field
Description
Rev_id[7:0]
Revision of the BOOT ROM, will be stored in a configuration register after the boot.
Rsvd
Reserved.
Word[16:0]
Specifies the number of words (8B) in the segment, including the segment header (e,
dest, word).
0
n
Dest[16:0]
2^17 words = 2^23 bits
n words = 64 n bits
Specifies the destination of the segment.
[16:13] channel number
[12:0] word address number
228
E
1
Specifies this segment is the end of the boot code.
Boot Code
Only the boot code portion will be forwarded to the specified destination. The format of
boot code is transparent to the SROM controller. SROM controller does not interpret the
boot code.
Data
The user data section of the ROM will not be processed by the controller.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Usage Model
19.2 Usage Model
19.2.1 Boot Operation
A boot sequence starts when a hard or soft boot request is received by the SPI SROM controller.
The SPI SROM controller always boots from address 0 in the boot ROM. The SPI SROM controller
reads the boot header. The SPI SROM controller then reads program segments and processes
them until a segment marked as the end segment is reached. The ROM header and the segment
header(s) are not forwarded and are only used by the SPI SROM controller.
19.2.2 SPI Flash Operations
19.2.2.1SPI Flash Instructions
In addition to the boot, the SROM controller supports the following instructions for the SPI based
serial flash, as listed in Table 19-2.
Table 19-2. SPI Flash Instructions
Instructions
Description
Instruction Format
Address
Bytes
Dummy
Bytes
Data
Bytes
(write)
Data
Bytes
(read)
BOOT
Boot
1 0000 0000 (100h)
0
0
0
0
WREN
Write enable
0 0000 0110 (06h)
0
0
0
0
WRDI
Write disable
0 0000 0100 (04h)
0
0
0
0
RDID0
Read identification
0 1001 1111 (9fh)
0
0
0
1 to 3
RDID1
Read identification
0 0001 0101 (15h)
0
0
0
1 to 3
RDSR
Read Status Register
0 0000 0101 (05h)
0
0
0
1
WRSR
Write Status Register
0 0000 0001 (01h)
0
0
1
0
READ
Read Data Bytes
0 0000 0011 (03h)
3
0
0
1 to max
PP
Page Program. 1 to 512
bytes can be programmed.
0 0000 0010 (02h)
3
0
1 to 512
0
SE0
Sector Erase. Erase one
sector in memory array.
0 1101 1000 (d8h)
3
0
0
0
SE1
Sector Erase. Erase one
sector in memory array.
0 0101 0010 (52h)
3
0
0
0
BE0
Bulk Erase. Erase all sectors in memory array.
0 1100 0111 (c7h)
0
0
0
0
BE1
Bulk Erase. Erase all sectors in memory array.
0 0110 0010 (62h)
0
0
0
0
DP
Deep power down.
0 1011 1001 (b9h)
0
0
0
0
RES
Release from deep
power-down and Read
Electronic Signature
0 1010 1011 (abh)
0
3
0
1 to max
RES
Release from deep
power-down.
0
0
0
0
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
229
Chapter 19 SPI Interface
19.2.2.2SPI Configurable Instruction Sets
The SPI SROM controller supports a configurable instruction sets. Refer to the configurable
Instruction Code registers for more details. Each instruction code defines a sequence of operations
associated with the instruction, for example, number of address bytes, number of dummy bytes,
number of write data bytes, and number of read data bytes.
19.2.2.3SPI Flash Unknown Instruction
For any instruction not defined in the configurable instruction code registers, the SPI SROM controller simply sends the unknown instruction (bit 7 to bit 0) to the external device. Therefore, it is
up to the software to carefully manage the instructions. An interrupt is generated when any
unknown instructions are encountered.
19.2.2.4SPI Flash Deep Power-Down
The external SPI SROM can be put into deep power-down mode (by executing the DP instruction).
Once it is in deep power-down, all instructions, except the RES instruction, are ignored. There is
no status bit to determine whether or not the current instruction is ignored. It is up to the software
to detect whether or not SROM is in deep power-down state.
19.2.2.5SPI Flash Write In-Progress
Once the external SROM is not in deep power-down mode, all attempts (except RDSR) to access
the memory array during a WRSR cycle, PP cycle or Erase cycle (SE or BE), are ignored, and the
internal WRSR cycle, PP cycle or Erase cycle, continues unaffected.
A status register tracks the activities of the external SROM. This register can be read at any time,
even while a program, erase or write status register cycles are in progress. It is strongly recommended to check the Write In Progress (WIP) bit before sending a new instruction to the SROM
device.
Bits
Name
Description
7
SRWD
Status register write disable bit.
6
Reserved
Read as 0
5
Reserved
Read as 0
4
BP2
3
BP1
Block Protect Bits. These bits define the size of the area to be software-protected against
Program and Erase instructions.
2
BP0
1
WEL
Write Enable Latch.
0
are
When this bit is set to 0, the Write Status Register, Program or Erase instructions
ignored.
0
WIP
Write In-Progress Bit. This bit indicates if the memory is busy with a Write Status Register,
Program, or Erase cycle. When read as 1, one of these cycles is in progress.
19.2.2.6SPI Flash Write Protection
SPI Flash has several protection schemes, the write protection pin (this is an off-chip function, for
example a jumper on board), status register protection (SRWD), block protect bits (BP) and the
write enable (WEL). Refer to memory vendor datasheet for more details. The software manages the
write protection (for example WEL must be enabled before a program or erase instruction can be
executed, the target area is not protected by the BP bits).
230
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Usage Model
19.2.2.7SPI Flash Page Mode
For optimized timings, it is recommended to use the PP instruction to program all consecutive
targeted bytes in a single sequence (up to the page size) versus using several PP sequences with
each containing only a few bytes. Note that the transfer size should not cross the page size boundary.
19.2.2.8SPI Flash Interface
The SPI interface provides four external pins. Only SPI Mode 0 clocking is supported.
•
spi_clk (SROM_SPI_SCK) (15.625MHz serial clock)
•
SROM_SPI_CS_N (chip select)
•
SROM_SPI_MOSI (mast out serial in, to serial flash)
•
SROM_SPI_MISO (master in serial out, from serial flash)
19.2.2.9Software Command Sequences to Execute an SPI Flash Instruction
In general, there are three steps involved in executing an SPI Flash instruction in Table 19-1. Step
1 and/or step 3 might not be necessary for certain SPI Flash instructions.
1.
Program the SROM SPI Address (SROM_ADDRESS) register and program the SROM SPI Byte
register.
2.
Program the SROM SPI Instruction (SROM_INSTRUCTION) register to start the operation.
3.
Data handling by either writing to the SROM SPI Write Data (SROM_WRITE_DATA) register or
by reading from the SROM SPI Read Data (SROM_READ_DATA) register.
Before you can program the SROM SPI Instruction (SROM_INSTRUCTION) register, the software
must to ensure that the SROM controller is idle. There are two usage models:
•
Interrupt
When the INST_EXEC interrupt (in the SROM_INT_VEC_W1TC register) is asserted, it means
the previous instruction is done. Software first clears the INST_EXEC interrupt bit, and then
writes to SROM SPI Instruction (SROM_INSTRUCTION) register.
•
Status Polling
Software can poll the BUSY bit of the SROM_FLAG register, and write to the SROM SPI Instruction (SROM_INSTRUCTION) register when BUSY=0.
Several instructions, including WRSR and PP, involve one or multiple bytes of write data. All
types of write data will be programmed by writing the SROM SPI Write Data (SROM_WRITE_DATA) register. Each write pushes one entry to the write FIFO. To write to the SROM SPI Write
Data register, software makes sure that the write FIFO is not full.
There are two usage models:
•
Interrupt
When the WFIFO_READ interrupt (of the SROM_INT_VEC_W1TC) is asserted, it means that previous write data has been processed. Software first clears the WFIFO_READ interrupt bit.
Software then writes to the SROM SPI Write Data (SROM_WRITE_DATA) register, based on the
status of the write FIFO (SROM_FLAG register). For example, the write FIFO is full if WFIFO_FULL=1; the write FIFO has one data entry if WFIFO_FULL=0 and WFIFO_EMPTY=0; the write
FIFO is empty if WFIFO_EMPTY=1.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
231
Chapter 19 SPI Interface
•
Status Polling
Software can poll the SROM_FLAG register, and write to the SROM SPI Write Data
(SROM_WRITE_DATA) register when the write FIFO is not full.
Several instructions, including RDSR, RDID0/RDID1, READ and RES, involve one or multiple
bytes of read data. All types of read data will be returned by reading the SROM SPI Read Data
(SROM_READ_DATA) register. Each read returns (pops) one entry from the read FIFO. To read from
the SROM SPI Read Data (SROM_READ_DATA) register, software ensure that the read FIFO is not
empty.
There are two usage models:
•
Interrupt
When the RFIFO_WRITE interrupt is asserted, it means new read data is available. Software
first clears the RFIFO_WRITE interrupt bit. Software then reads the SROM SPI Read Data register based on the status of the read FIFO (SROM_FLAG register).
•
Status Polling
Software can poll the SROM_FLAG register and read from the SROM SPI Read Data
(SROM_READ_DATA) register when the read FIFO is not empty.
Table 19-1 shows examples of how an SPI Flash instruction will be implemented by a sequence of
software reads and writes. For simplicity, the detailed steps of how to write the SROM SPI
Instruction (SROM_INSTRUCTION) register is marked as Write(SROM_INSTRUCTION). The detailed
steps of how to write the SROM SPI Write Data (SROM_WRITE_DATA) register is marked as
Write(SROM_WRITE_DATA). The detailed steps of how to read the SROM SPI Read Data
(SROM_READ_DATA) register is marked as Read(SROM_READ_DATA).
Table 19-1. SPI Flash Implementation Instructions
Instruction
Software Command Sequence
RDSR
• Write(SROM_INSTRUCTION) = 05h
• Read(SROM_READ_DATA) returns status register once.
Note: Status returns on SROM_READ_DATA[7:0].
WRSR
•
•
•
•
Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
Make sure write is enabled (for example execute the WREN instruction)
Write(SROM_INSTRUCTION) = 01h
Write(SROM_WRITE_DATA) = desired status setting.
Note: SROM_WRITE_DATA[7:0] contains the status setting.
WREN
• Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
• Write(SROM_INSTRUCTION) = 06h
WRDI
• Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
• Write(SROM_INSTRUCTION) = 04h
RDID0/RDID1
•
•
•
•
Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
Write(SROM_BYTE) = 1/2/3
Write(SROM_INSTRUCTION) = 9fh or 15h
Read(SROM_READ_DATA) returns RDID.
Note: Typically RDI has one byte of manufacturer identification (stored in SROM_READ_DATA[23:16]), and two bytes of device identification, for example memory type and memory density (stored in SROM_READ_DATA[15:0]).
232
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Usage Model
Table 19-1. SPI Flash Implementation Instructions (continued)
Instruction
Software Command Sequence
READ
•
•
•
•
•
Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
Write(SROM_ADDRESS) = starting byte address to read
Write(SROM_BYTE) = desired transfer size (must be in 8B quantities).
Write(SROM_INSTRUCTION) = 03h
Perform multiple Read(SROM_READ_DATA) according to the SROM_BYTE. One
Read(SROM_READ_DATA) returns eight bytes of read data.
Note: SROM_READ_DATA[7:0] contains data for the lowest address, SROM_READ_DATA[63:56] contains data for the highest address.
PP
• Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
• Make sure write is enabled and the targeted area is not protected (for example execute the
WREN instruction or WRSR instruction)
• Write(SROM_ADDRESS) = starting byte address to program
• Write(SROM_BYTE) = desired transfer size (must be in 8B quantities and should not cross
the page size boundary specified in the SROM hardware specification).
• Write(SROM_INSTRUCTION) = 02h
• Perform multiple Write(SROM_WRITE_DATA) according to the SROM_BYTE. One
Write(SROM_WRITE_DATA) contains eight bytes of desired write value.
Note: SROM_WRITE_DATA[7:0] contains data for the lowest address, SROM_WRITE_DATA[63:56] contains data for the highest address.
SE0/SE1
• Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
• Make sure write is enabled and the targeted area is not protected (for example execute the
WREN instruction or WRSR instruction)
• Write(SROM_ADDRESS) = any byte address within the sector to be erased.
• Write(SROM_INSTRUCTION) = d8h or 52h.
BE0/BE1
• Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
• Make sure write is enabled and the targeted area is not protected (for example execute the
WREN instruction or WRSR instruction)
• Write(SROM_INSTRUCTION) = c7h or 62h
DP
• Poll SROM status register (via an RDSR instruction) until the WIP bit is clear.
• Write(SROM_INSTRUCTION) = b9h
RES
• Write(SROM_BYTE) = 0/3, 3 for read electronic signature
• Write(SROM_INSTRUCTION) = abh
• Read(SROM_READ_DATA) returns electronic signature, if desired.
Note: SROM_READ_DATA[7:0] contains the electronic signature.
19.2.2.10Interface Timing
Table 19-2 lists of timing arcs that have been implemented by hardware (assuming 125MHz reference clock). The program and erase operations may take long time, and the WIP bit is used to
check whether or not the operation has finished.
Table 19-2. Timing Arcs Implemented by Hardware
Symbol
Timing Parameter
Hardware Implemented Value
tSHSL
Chip deselect time
240 ns
tDP
Chip high to Deep Power-down mode
6.4 us
tRES1
Chip high to standby power mode without
Electronic signature read
64 us
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
233
Chapter 19 SPI Interface
Table 19-2. Timing Arcs Implemented by Hardware (continued)
Symbol
Timing Parameter
Hardware Implemented Value
tRES2
Chip high to standby power mode without
Electronic signature read
64 us
tSLCH
Chip select active setup time
44 ns
tCHSL
Chip select not active hold time
44 ns
tDVCH
Master out and slave in setup time
12 ns
tCHDX
Master out and slave in hold time
12 ns
tCHSH
Chip select active hold time
44 ns
tSHCH
Chip select not active setup time
44 ns
Write protection is supported via the status registers. External SROM devices can support write
protection and hold pins, which are supported by board functions.
19.2.3 Rshim Interface
19.2.3.1Rshim Register Interface
The SROM controller is accessed by a register interface. Write will be acknowledged once the previous write is consumed. Read will be acknowledged once read data is returned from external
SROM. If there is no read data to be returned, the read will be acknowledged immediately (with
0s), and an interrupt will be generated.
19.2.3.2Rshim Host Interface
The host interface is used during the boot sequence. Once the boot code is read from external
SROM, it is first stored in a small boot FIFO. The host interface assembles the channel number,
register number, together with the boot code. A simple busy and acknowledge protocol is applied
between this interface and Rshim on the other side.
19.2.3.3Error Handling and Interrupts
It is up to the software to maintain the correct boot ROM format, that is correct encoding to specify the length of the boot code. It is up to software to maintain correct protocol in order to execute
the SPI based serial flash instructions, that is program SROM_ADDRESS/SROM_BYTE before programming SROM_INSTRUCTION, read and program SROM must be in 8B quantities. It is up to
software to issue recognizable instructions.
Error conditions will be logged, refer to SROM interrupt registers on how to log errors. Interrupts
will be sent to a binding tile once error condition encountered, the binding is defined by the
Rshim interrupt binding (RSH_INT_BIND) register.
234
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
A PPENDIX A:
JTAG I NTERFACE
The TILE-Gx processor supports standard boundary scan and is 1149.1-compliant. A BSDL file is
available. In addition to standard 1149.1 registers, the TILE-Gx JTAG interface provides instructions to read and write a number of internal data registers distributed throughout the processor.
Access to the JTAG controller is shared between the JTAG I/O pins and an internal JTAG controller. In order to enable access from the JTAG I/O pins, the RJC pin must be deasserted. For more
information, refer to TILE-Gx36 Data Sheet (DS400).
Most of the JTAG registers are for test purposes only, however the Rshim data register supports
read and write transactions to the RSHIM MMIO registers. The JTAG instruction for Rshim
access is 27 bits long and has the value 0x02C009A. Writing the Rshim access data register can initiate a read or write operation to the Rshim MMIO registers. Reading the Rshim access data
register supplies status and data from the previous read or write operation. A data written into
the Rshim data register has the following format:
Table A-1. Write Data Format
Bits
Function
Description
1:0
CMD
0
1
2
3
65:2
DATA
Data to be written if write operation.
78:66
REG
Offset of register to be accessed.
82:79
CHAN
Channel to be accessed.
Nop.
Read.
Write.
Reserved.
The data read from the Rshim data register has the following format:
Table A-2. Read Data Format
Bits
Function
Description
1:0
STATUS
0
3
65:2
DATA
Register data if the operation was a read and has completed.
85:66
PAD
Pad data.
Read/Write complete.
Read/Write incomplete.
The STATUS field of the data register can be used to determine of the Rshim access function is idle
or if an operation is being performed at that time.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
235
Appendix A: JTAG Interface
Read operations are accomplished by shifting in the READ command into the Rshim data register
and updating it. Then the Rshim data register can capture the read data, and it can be shifted out.
In the unlikely event that the read transaction had not completed by the time the data was captured, the status will be reported as 3, and another read can be attempted.
Write operations can poll the STATUS field to ensure that previous operations are complete before
starting the next operation.
236
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
APPENDIX B: CLASSIFIER INSTRUCTIONS AND
SPRS
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
destReg
Immediate Op
mfspr/mtspr
8
7
6
5
4
opcode
Branch
3
2
1
0
srcB
Reserved
Register Op
immediate
Reserved
srcA
spr_idx
target
Reserved
Reserved
Jump
Figure B-1: Classifier Instruction Format
Table B-1. destReg, srcA, and srcB Encodings
RegisterNumber
Name/Description
As Destination
As Sourcea
21:0
GPRs. General Purpose Registers.
22
hPtr
23
tbl[tPtr++(1)] (mem1)
pdPtr
24
tbl[tPtr++(2)] (mem2)
tPtr
25
iHdr[hPtr] (peek2)
26
iHdr[hPtr++(2)] (get2)
27
Hash0_lo
28
Hash0_hi
29
Hash1 lo
30
Hash1 hi
31
NULL
pDesc[pdPtr++] (put2, inc1)
pDesc[pdPtr++(2)] (put2, inc2)
a When only one description is provided, the meaning is the same for Source and Destination.
B.1 Classifier Instructions
The classifier instructions, listed below, are defined in the following sections:
•
Arithmetic Instructions
•
Comparison Instructions
•
Control Instructions
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
237
Appendix B: Classifier Instructions and SPRs
•
Logical Instructions
•
Miscellaneous Instructions
B.1.1
Arithmetic Instructions
Table B-2. Arithmetic Instructions
238
Instruction
Example
Opcode
Function
Description
ADD
add r3,r1,r2
62
rf[Dest] = rf[SrcA] + rf[SrcB];
Add two words together.
ADDI
addi
r3,r1,-5
63
rf[Dest] = rf[SrcA] + Imm16;
Add the contents of a
register and an immediate.
CSUM
csum
r3,r1,r2
61
rf[Dest] = csum(rf[SrcA],
rf[SrcB]);
Compute the checksum
of two words.
CSUM_HASH0_ACC2
csum_hash0
_acc2
r3,r1,r2
48
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum0 =
crc32_16(Accum0, rf[SrcB]);
rf[Dest] = csum(tmpA,tmpB);}
Compute the checksum
of two words and accumulate the CRC of one
of the words.
CSUM_HASH0_SEED2
csum_hash0
_seed2
r3,r1,r2
52
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum0 =
crc32_16(0xffffffff, rf[SrcB]);
rf[Dest] = csum(tmpA, tmpB);}
Compute the checksum
of two words and CRC of
one of the words with a
seed into the accumulator.
CSUM_HASH1_ACC2
csum_hash1
_acc2
r3,r1,r2
49
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum1 =
crc32_16(Accum1, rf[SrcB]);
rf[Dest] = csum(tmpA, tmpB);}
Compute the checksum
of two words and accumulate the CRC of one
of the words.
CSUM_HASH1_SEED2
csum_hash1
_seed2
r3,r1,r2
53
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum1 =
crc32_16(0xffffffff, rf[SrcB]);
rf[Dest] = csum(tmpA, tmpB);}
Compute the checksum
of two words and CRC of
one of the words with a
seed into the accumulator.
CSUM_HASH16_0
csum_hash16_
0 r3,r1,r2
48
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum0 =
crc32_16(Accum0, rf[SrcB]);
rf[Dest] = csum(tmpA,tmpB);}
Compute the checksum
of two words and accumulate the CRC of one
of the words.
CSUM_HASH16_1
csum_hash16_
1 r3,r1,r2
49
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum1 =
crc32_16(Accum1, rf[SrcB]);
rf[Dest] = csum(tmpA, tmpB);}
Compute the checksum
of two words and accumulate the CRC of one
of the words.
CSUM_HASH8_0
csum_hash8_0
r3,r1,r2
50
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum0 =
crc32_8(Accum0, (rf[SrcB] &
0xff)); rf[Dest] = csum(tmpA,
(tmpB & 0xff));}
Compute the checksum
of two words and accumulate the CRC of one
of the words.
CSUM_HASH8_1
csum_hash8_1
r3,r1,r2
51
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum1 =
crc32_8(Accum1, (rf[SrcB] &
0xff)); rf[Dest] = csum(tmpA,
(tmpB & 0xff));}
Compute the checksum
of two words and accumulate the CRC of one
of the words.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Classifier Instructions
Table B-2. Arithmetic Instructions (continued)
Instruction
Example
Opcode
Function
Description
CSUM_HASHS16_0
csum_hashs16
_0 r3,r1,r2
52
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum0 =
crc32_16(0xffffffff, rf[SrcB]);
rf[Dest] = csum(tmpA, tmpB);}
Compute the checksum
of two words and CRC of
one of the words with a
seed into the accumulator.
CSUM_HASHS16_1
csum_hashs16
_1 r3,r1,r2
53
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum1 =
crc32_16(0xffffffff, rf[SrcB]);
rf[Dest] = csum(tmpA, tmpB);}
Compute the checksum
of two words and CRC of
one of the words with a
seed into the accumulator.
CSUM_HASHS8_0
csum_hashs8_
0 r3,r1,r2
54
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum0 =
crc32_8(0xffffffff, (rf[SrcB] &
0xff)); rf[Dest] = csum(tmpA,
(tmpB & 0xff));}
Compute the checksum
of two words and CRC of
one of the words with a
seed into the accumulator.
CSUM_HASHS8_1
csum_hashs8_
1 r3,r1,r2
55
{unsigned int tmpA = rf[SrcA],
tmpB = rf[SrcB]; Accum1 =
crc32_8(0xffffffff, (rf[SrcB] &
0xff)); rf[Dest] = csum(tmpA,
(tmpB & 0xff));}
Compute the checksum
of two words and CRC of
one of the words with a
seed into the accumulator.
SHL1ADD
shl1add
r3,r1,r2
16
rf[Dest] = (rf[SrcA] << 1) +
rf[SrcB];
Shifts the first operand
left by one bit and then
adds the second source
operand.
SHL1ADDI
shl1addi
r3,r1,7
17
rf[Dest] = (rf[SrcA] << 1) +
Imm16;
Shifts the first operand
left by one bit and then
adds a 16-bit immediate.
SUB
sub r3,r1,r2
60
rf[Dest] = rf[SrcA] - rf[SrcB];
Subtract one word from
another.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
239
Appendix B: Classifier Instructions and SPRs
B.1.2
Comparison Instructions
Table B-3. Comparison Instructions
240
Instruction
Example
Opcode
Function
Description
CMPEQ
cmpeq
r3,r1,r2
44
rf[Dest] = ((signed
short)rf[SrcA] == (signed
short)rf[SrcB]) ? 0x0001 :
0x0000;
Set the destination register to 0x0001 if the first
source operand is equal
to the second source
operand. Otherwise, set
the destination register to
0x0000.
CMPEQI
cmpeqi
r3,r1,0x87
45
rf[Dest] = ((signed
short)rf[SrcA] == (signed
short)Imm16) ? 0x0001 : 0x0000;
Set the destination register to 0x0001 if the first
source operand is equal
to the 16-bit immediate.
Otherwise, set the destination register to 0x0000.
CMPLES
cmples
r3,r1,r2
40
rf[Dest] = ((signed
short)rf[SrcA] <= (signed
short)rf[SrcB]) ? 0x0001 :
0x0000;
Set the destination register to 0x0001 if the first
source operand is less
than or equal to the second source operand. Otherwise, set the
destination register to
0x0000.
CMPLESI
cmplesi
r3,r1,7
42
rf[Dest] = ((signed
short)rf[SrcA] <= (signed
short)Imm16) ? 0x0001 : 0x0000;
Set the destination register to 0x0001 if the first
source operand is less
than or equal to the 16-bit
immediate. Otherwise,
set the destination register to 0x0000.
CMPLEU
cmpleu
r3,r1,r2
41
rf[Dest] = (rf[SrcA] <= rf[SrcB])
? 0x0001 : 0x0000;
Set the destination register to 0x0001 if the first
unsigned source operand is less than or equal
to the second unsigned
source operand. Otherwise, set the destination
register to 0x0000.
CMPLEUI
cmpleui
r3,r1,7
43
rf[Dest] = (rf[SrcA] <= Imm16) ?
0x0001 : 0x0000;
Set the destination register to 0x0001 if the first
unsigned source operand is less than or equal
to the unsigned 16-bit
immediate. Otherwise,
set the destination register to 0x0000.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Classifier Instructions
Table B-3. Comparison Instructions (continued)
Instruction
Example
Opcode
Function
Description
CMPLTS
cmplts
r3,r1,r2
36
rf[Dest] = ((signed
short)rf[SrcA] < (signed
short)rf[SrcB]) ? 0x0001 :
0x0000;
Set the destination register to 0x0001 if the first
source operand is less
than the second source
operand. Otherwise, set
the destination register to
0x0000.
CMPLTU
cmpltu
r3,r1,r2
37
rf[Dest] = (rf[SrcA] < rf[SrcB])
? 0x0001 : 0x0000;
Set the destination register to 0x0001 if the first
unsigned source operand is less than the second unsigned source
operand. Otherwise, set
the destination register to
0x0000.
CMPNE
cmpne
r3,r1,r2
46
rf[Dest] = ((signed
short)rf[SrcA] != (signed
short)rf[SrcB]) ? 0x0001 :
0x0000;
Set the destination register to 0x0001 if the first
source operand is not
equal to the second
source operand. Otherwise, set the destination
register to 0x0000.
CMPNEI
cmpnei
r3,r1,0x87
47
rf[Dest] = ((signed
short)rf[SrcA] != (signed
short)Imm16) ? 0x0001 : 0x0000;
Set the destination register to 0x0001 if the first
source operand is not
equal to the 16-bit immediate. Otherwise, set the
destination register to
0x0000.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
241
Appendix B: Classifier Instructions and SPRs
B.1.3
Control Instructions
Table B-4. Control Instructions
242
Instruction
Example
Opcode
Function
Description
BGEZ
bgez
r1,0x1000
10
if ((signed short)rf[SrcA] >= 0)
{ setNextPC(Imm16); delay += 2;
taken = true; } else { setNextPC(pc + 1); }
Branch to the target
address if the source
operand is greater than
or equal to zero.
BGEZT
bgezt
r1,0x1000
11
if ((signed short)rf[SrcA] >= 0)
{ setNextPC(Imm16); delay += 1;
taken = true; } else { setNextPC(pc + 1); delay += 2; }
Branch to the target
address if the source
operand is greater than
or equal to zero. Provide the hardware a hint
that the branch will be
taken.
BLBC
blbc
r1,0x1000
14
if (!(rf[SrcA] & 0x1)) { setNextPC(Imm16); delay += 2; taken
= true; } else { setNextPC(pc +
1); }
Branch to the target
address if Bit 0 of the
source operand is 0.
BLBCT
blbct
r1,0x1000
15
if (!(rf[SrcA] & 0x1)) { setNextPC(Imm16); delay += 1; taken
= true; } else { setNextPC(pc +
1); delay += 2; }
Branch to the target
address if Bit 0 of the
source operand is 0.
Provide the hardware a
hint that the branch will
be taken.
BLBS
blbs
r1,0x1000
12
if (rf[SrcA] & 0x1) { setNextPC(Imm16); delay += 2; taken
= true; } else { setNextPC(pc +
1); }
Branch to the target
address if Bit 0 of the
source operand is 1.
BLBST
blbst
r1,0x1000
13
if (rf[SrcA] & 0x1) { setNextPC(Imm16); delay += 1; taken
= true; } else { setNextPC(pc +
1); delay += 2; }
Branch to the target
address if Bit 0 of the
source operand is 1.
Provide the hardware a
hint that the branch will
be taken.
BLTZ
bltz
r1,0x1000
8
if ((signed short)rf[SrcA] < 0) {
setNextPC(Imm16); delay += 2;
taken = true; } else { setNextPC(pc + 1); }
Branch to the target
address if the source
operand is less than
zero.
BLTZT
bltzt
r1,0x1000
9
if ((signed short)rf[SrcA] < 0) {
setNextPC(Imm16); delay += 1;
taken = true; } else { setNextPC(pc + 1); delay += 2; }
Branch to the target
address if the source
operand is less than
zero. Provide the hardware a hint that the
branch will be taken.
JR
jr r1
5
setNextPC(rf[SrcA]); delay += 2;
Jump to the address in
the source register.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Classifier Instructions
B.1.4
Logical Instructions
Table B-5. Logical Instructions
Instruction
Example
Opcode
Function
Description
AND
and r3,r1,r2
22
rf[Dest] = rf[SrcA] & rf[SrcB];
Compute the logical
AND of two words.
ANDI
andi
r3,r1,0x0f
23
rf[Dest] = rf[SrcA] & Imm16;
Compute the logical
AND of a word and a
16-bit immediate.
OR
or r3,r2,r1
24
rf[Dest] = rf[SrcA] | rf[SrcB];
Compute the logical OR
of two words.
ORI
ori
r3,r1,0x0f
25
rf[Dest] = rf[SrcA] | Imm16;
Compute the logical OR
of a word and a 16-bit
immediate.
ROTL
rotl
r3,r1,r2
34
rf[Dest] = (rf[SrcA] >> (16 (rf[SrcB] & 0xf))) | (rf[SrcA] <<
(rf[SrcB] & 0xf));
Rotate a word left by the
number of bits specified
in the second source
operand.
ROTLI
rotli
r3,r1,r2
35
rf[Dest] = ((unsigned
short)rf[SrcA] >> (16 - (Imm16 &
0xf))) | ((unsigned
short)rf[SrcA] << (Imm16 & 0xf));
Rotate a word left by the
number of bits specified
in an unsigned 16-bit
immediate.
SHL
shl r3,r1,r2
28
rf[Dest] = rf[SrcA] << (rf[SrcB]
& 0xf);
Left-shift a word by the
number of bits specified
in the second source
register.
SHLI
shli r3,r1,7
29
rf[Dest] = rf[SrcA] << (Imm16 &
0xf);
Left-shift a word by the
number of bits specified
in an unsigned 16-bit
immediate.
SHRS
shrs r3,r1,7
32
{ unsigned int sign = rf[SrcA] &
0x8000; unsigned int tmp =
rf[SrcA]; for (int i = 0; i <
(rf[SrcB] & 0xf); i++) { tmp >>=
1; tmp |= sign; } rf[Dest] = tmp;
}
Right-shift a word by the
number of bits specified
in the second source
register. Upper bits are
filled with the value in
the most-significant bit of
the first source operand.
SHRSI
shrsi
r3,r1,7
33
{ unsigned int sign = rf[SrcA] &
0x8000; unsigned int tmp =
rf[SrcA]; for (unsigned int i =
0; i < (Imm16 & 0xf); i++) { tmp
>>= 1; tmp |= sign; } rf[Dest] =
tmp; }
Right-shift a word by the
number of bits specified
in an unsigned 16-bit
immediate. Upper bits
are filled with the value
of the most-significant bit
of the first source operand.
SHRU
shru
r3,r1,r2
30
rf[Dest] = rf[SrcA] >> (rf[SrcB]
& 0xf);
Right-shift a word by the
number of bits specified
in the second source
register. Upper bits are
filled with zeros.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
243
Appendix B: Classifier Instructions and SPRs
Table B-5. Logical Instructions (continued)
244
Instruction
Example
Opcode
Function
Description
SHRUI
shrui
r3,r1,7
31
rf[Dest] = rf[SrcA] >> (Imm16 &
0xf);
Right-shift a word by the
number of bits specified
in an unsigned 16-bit
immediate. Upper bits
are filled with zero.
XOR
xor r3,r1,r2
26
rf[Dest] = rf[SrcA] ^ rf[SrcB];
Compute the logical
XOR of two words.
XORI
xori
r3,r1,0x0f
27
rf[Dest] = rf[SrcA] ^ Imm16;
Compute the logical
XOR of a word and a
16-bit immediate.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Classifier Instructions
B.1.5
Miscellaneous Instructions
Table B-6. Miscellaneous Instructions
Instruction
Example
Opcode
Function
Description
CTZ
ctz r3,r1
21
{ unsigned int counter;
for(counter = 0; counter < 16;
counter++){ if( (rf[SrcA] >>
counter ) & 0x1){ break; } }
rf[Dest] = counter; }
Returns the number trailing zeros in a word
before a bit is set (1).
This instruction scans
the input word from the
least significant bit to the
most significant bit. The
result of this operation
can range from 0 to 16.
MFSPR
mfspr r6,0x5
3
rf[Dest] = sprf[SprIdx];
Move a word from the
special purpose register
indexed by an immediate.
MTSPR
mtspr 0x5,r1
2
sprf[SprIdx] = rf[SrcA];
Move a word to the special purpose register
indexed by an immediate.
REDMG4
redmg4
r3,r2,r1
59
rf[Dest] =
(crc8_16(0xff,rf[SrcA]) & 0xf) |
(rf[SrcB] & 0xfff0);
Performs an 8-bit CRC
reduction on a 16 bit
input (keeping the low 4
bits) and merges with
the upper 12-bits of the
other input.
REDMG6
redmg6
r3,r2,r1
58
rf[Dest] =
(crc8_16(0xff,rf[SrcA]) & 0x3f) |
(rf[SrcB] & 0xffc0);
Performs an 8-bit CRC
reduction on a 16 bit
input (keeping the low 6
bits) and merges with
the upper 10-bits of the
other input.
REDMG8
redmg8
r3,r2,r1
57
rf[Dest] =
(crc8_16(0xff,rf[SrcA]) & 0xff) |
(rf[SrcB] & 0xff00);
Performs an 8-bit CRC
reduction on a 16 bit
input (keeping the low 8
bits) and merges with
the upper 8-bits of the
other input.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
245
Appendix B: Classifier Instructions and SPRs
B.2 Registers
B.2.1
Register Summary
Table B-7. Register Summary
246
Address
Register
0000
“CLASSIFIER HEADER FLAGS (CLASSIFIER_HEADER_FLAGS: 0X0000)” on page 247
0001
“CLASSIFIER CHANNEL (CLASSIFIER_CHANNEL: 0X0001)” on page 248
0002
“CLASSIFIER CURRENT PACKET SIZE (CLASSIFIER_L2_SIZE: 0X0002)” on page 249
0003
“CLASSIFIER PASS (CLASSIFIER_PASS: 0X0003)” on page 250
0004
“CLASSIFIER BUDGET (CLASSIFIER_BUDGET: 0X0004)” on page 251
0006
“CLASSIFIER CONTROL (CLASSIFIER_CTL: 0X0006)” on page 252
0007
“CLASSIFIER RAND (CLASSIFIER_RAND: 0X0007)” on page 253
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Registers
B.2.2
Register Definitions
CLASSIFIER HEADER FLAGS (CLASSIFIER_HEADER_FLAGS: 0X0000)
This register contains information about the iHdr.
3.7B(55
&(55
7581&
0(
+9/'
5HVHUYHG[
&7
Figure B-2: CLASSIFIER_HEADER_FLAGS Register
Table B-8. CLASSIFIER_HEADER_FLAGS Register Bit Descriptions
Bits
Name
Type
Reset
Description
15
CT
RO
0
If asserted, packet was cut-through. This bit will be copied into the
descriptor.
5:4
HVLD
RO
0
Valid Headers. Provides a count of the number of valid headers
buffered for the classifier. Since the classifier stalls on MFSPR to
this SPR if the header is not valid, the only values the program will
ever see are 1 or 2.
3
ME
RO
0
MAC Error. If asserted, packet from MAC was terminated with a
MAC-error indicator. This bit will be copied into the descriptor. Note
that this bit is not valid on packets that are cutting through.
2
TRUNC
RO
0
If asserted, packet was truncated due to insufficient storage in
packet buffer. This bit will be copied into the descriptor. Note that
this bit is not valid on packets that are cutting through.
1
CERR
RO
0
L2 CRC Error. If asserted, packet had an L2 CRC error. This bit will
be copied into the descriptor. Note that this bit is not valid on packets that are cutting through.
0
PKT_ERR
RO
0
Packet Error. This field indicates that the packet had a packet error.
It is equivalent to (CERR || TRUNC || ME).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
247
Appendix B: Classifier Instructions and SPRs
CLASSIFIER CHANNEL (CLASSIFIER_CHANNEL: 0X0001)
This register indicates the source channel for the original packet.
&+$11(/
5HVHUYHG[
Figure B-3: CLASSIFIER_CHANNEL Register
Table B-9. CLASSIFIER_CHANNEL: 0x0001 Register Bit Descriptions
248
Bits
Name
Type
Reset
Description
4:0
CHANNEL
RO
0
Channel for the original packet.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Registers
CLASSIFIER CURRENT PACKET SIZE (CLASSIFIER_L2_SIZE: 0X0002)
This register indicates the number of bytes in the current packet not including SFD, the preamble,
or CRC. If the CT bit of the CLASSIFIER_HEADER_FLAGS register is asserted, this register indicates the number of bytes received before classification.
/B6,=(
5HVHUYHG[
Figure B-4: CLASSIFIER_L2_SIZE Register
Table B-10. CLASSIFIER_L2_SIZE Register Bit Descriptions
Bits
Name
Type
Reset
Description
13:0
L2_SIZE
RO
0
Number of bytes in L2, or number of bytes received before classification.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
249
Appendix B: Classifier Instructions and SPRs
CLASSIFIER PASS (CLASSIFIER_PASS: 0X0003)
This register is used in an implementation-specific way for communication between the simulator
and tile software. It shares the same 16-bit storage area as CLASSIFIER_FAIL and CLASSIFIER_DONE. This data is also visible to Tile software via I/O configuration and can be used to
exchange data between the Classifier and Tile software.
'$7$
Figure B-5: CLASSIFIER_PASS Register
Table B-11. CLASSIFIER_PASS Register Bit Descriptions
250
Bits
Name
Type
Reset
Description
15:0
DATA
RW
0
Implementation-specific data.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Registers
CLASSIFIER BUDGET (CLASSIFIER_BUDGET: 0X0004)
This register provides the current cycle budget information.
&17
5HVHUYHG[
%(/2:B+:0
Figure B-6: CLASSIFIER_BUDGET Register
Table B-12. CLASSIFIER_BUDGET Register Bit Descriptions
Bits
Name
Type
Reset
Description
15
BELOW_HWM
RO
0
When asserted, the classifier header queue is below the high
water mark and the cycle budget will be ignored. When clear,
the cycle budget counter reaching zero will cause the current
packet processing to be terminated.
10:0
CNT
RO
0
Current budget cycle count. When this count reaches zero
and the BELOW_HWM bit is clear, the processing of the current packet will be terminated and the default DEST, NOTIFRING, and STACK fields from the CLS_CTL register will
be applied.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
251
Appendix B: Classifier Instructions and SPRs
CLASSIFIER CONTROL (CLASSIFIER_CTL: 0X0006)
This register contains control bits for the classifier.
)5=
7,17
5HVHUYHG[
Figure B-7: CLASSIFIER_CTL Register
Table B-13. CLASSIFIER_CTL Register Bit Descriptions
252
Bits
Name
Type
Reset
Description
1
TINT
WO
0
When written with a 1, send an interrupt to the tile. Always reads
zero.
0
FRZ
WO
0
When written with a 1, freeze the classifier. Always reads zero.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Registers
CLASSIFIER RAND (CLASSIFIER_RAND: 0X0007)
This register contains a pseudo-random value.
9$/
Figure B-8: CLASSIFIER_RAND Register
Table B-14. CLASSIFIER_RAND Register Bit Descriptions
Bits
Name
Type
Reset
Description
15:0
VAL
RW
0
Value. This bit provides a pseudo-random number based on an
LFSR.
0
Advances on each read when the
RAND_MODE bit of
CLS_CTL configuration register is zero.
Free running when the RAND_MODE bit of the
CLS_CTL register is one. If written with 0, VAL will
the
1
never advance.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
253
Appendix B: Classifier Instructions and SPRs
254
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
A PPENDIX C: M ISCELLANEOUS A CCELERATOR
SPECIFICATIONS
This appendix provides additional information about the four types of accelerators included with
the TILE-Gx family of processors. These are:
•
C.1 SNOW-3G Engines
•
C.2 KASUMI Engines
•
C.3 Packet Processor — Programming
•
C.5 Public Key Accelerator (PKA)
C.1 SNOW-3G Engines
C.1.1
Specification Summary
The SNOW-3G Engine supports the following features:
•
Fully supports SNOW 3G
•
Supported key size: 128-bit
•
Key scheduling hardware
•
Supported modes: UEA2, UIA2, 128-EEA1 and 128-EIA1
Note:
•
These latter four modes are all the available encryption and integrity algorithms
defined for SNOW 3G within 3GPP.
Fully synchronous design
Note: The SNOW 2.0 algorithm as defined in [SNOW2] is similar to 3GPP SNOW algorithm as
defined in [SNOW-3G], however not equal. This means that it is not possible to perform
SNOW2.0 [SNOW2] operation with this SNOW 3G core.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
255
Appendix C: Miscellaneous Accelerator Specifications
SNOW Engine
iv [127:0]
key [127:0]
Key Stream Generation
mode [1:0]
Engine Control
data [31:0]
length [15:0]
Feedback Modes
data [31:0]
Figure C-1: SNOW-3G Block Diagram
C.1.2
Performance
C.1.2.1 Introduction
The following sections specify the performance of the SNOW-3G. For all numbers in this chapter,
it is assumed that the engine is kept fully utilized, that is the host is supplying input blocks and
retrieving output blocks in such a way that the engine never needs to wait for input and that the
previous result has been retrieved before the next output becomes ready.
In the first table of each section, the “cycles per block”, “bits/cycle” and “throughput at maximum
frequency” numbers provided do not apply for the first block after selection of a new key and/or
mode of operation.
The second table in each section gives the extra clock cycles required for changing a context (key
and/or mode). For each new key and/or IV, the LSFR of the key generation module requires 32
cycles to start up. Note that for the authentication modes (UIA2 and 128-EIA1), five basic 32-bit
SNOW operations are required per authentication operation, of which two need to be executed
beforehand. The other three are performed during the authentication of the message. The authentication operation itself is based on a 64-bit polynomial multiplication.
256
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
SNOW-3G Engines
Performance of the SNOW Engine
Table C-83 lists the performance of the SNOW Engine for all supported modes of operation.
Table C-83. High-Speed Performance
Key
Size
Direction
Mode
128
Encryption
UEA2 /
Decryption
128-EEA1
Encryption
UIA2 /
Decryption
128-EIA1
Input/
Output
Block Size
Cycles
per Block
Throughput
Bits/
Cycle
At SNOW Typical
Clock Frequency of
600 MHz Mbits/sec
32
1
32
19200
SNOW requires initialization of the LFSR and state. Therefore, 32 rounds are required for initialization after a key, IV and/or mode switch. The number of initialization cycles is per mode is
shown in Table C-84.
Table C-84. High-Speed Context Switch Overhead
Overhead per Mode per New Context
(Direction Independent)
Extra Cycles Needed per New Context (Key / IV)
mode is basic
32 + 3 = 35
mode is UEA2 / 128-EEA1
32 + 3 = 35
mode is UIA2 / 128-EIA1
32 + 1 + 2*1 + 2 + 2 + 3 = 41
C.1.3
Functional Description
C.1.3.1 SNOW Key Stream Generator
Introduction
The SNOW Key stream generator implements the SNOW algorithm as specified in “[SNOW-3G]”
on page 578. The core operates on the input IV and key and performs the required Feedback Shift,
substitution, multiply and XOR operations. Each round results in a 32-bit key that is used to
encrypt the data.
Inherently, considerable parallelism is possible with the SNOW algorithm. This is described in
the sections that follow.
Sub-Modules
FSM
The main SNOW FSM in the key generation module consists of three state registers, two sets of sboxes to apply transform S1 and S2 [SNOW-3G], two 32-bit adders and two 32-bit XOR operations.
At the start of the key initialization, the state registers are in reset and the state builds-up during
the initialization phase of the LSFR.
LFSR
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
257
Appendix C: Miscellaneous Accelerator Specifications
The largest component of the key generation module is the LFSR. It contains 16 32-bit SHIFT registers, connected sequentially and a feedback loop consisting of three XOR operations, alpha
multiplication and alpha division. The latter two components are defined as two recursive functions, explained in detail in the next paragraph.
α
and α-1
Two large components of the feedback loop of the LFSR are the alpha multiplication (MULα),
which is a recursive multiplication in the GF(28) domain, and an inverse operation in the GF(28)
domain defined as alpha division (DIVα). These two functions have an 8-bit input and return a 32bit result. Since these functions are recursive up to 256 levels deep, it is not efficient with respect
to performance to implement the functions itself. To map these functions efficiently to hardware,
α and α-1 are implemented as 8x32-bit look-up tables.
For more information about the SNOW algorithm and the mathematical background of the
described functions, refer to the SNOW specifications [SNOW-3G].
C.1.4
Feedback Logic and XOR
The feedback logic module implements the confidentiality and integrity algorithms for SNOW as
they are defined by ETSI/SAGE for use within 3GPP. Refer to [UEA2-UIA2].
In the confidentiality modes of operation (UEA2/128-EEA1), the plaintext/ciphertext is simply
XORed with the generated key data. The result of the XOR operation is the corresponding ciphertext/plaintext.
The integrity algorithm requires additional functional and control logic. Before the authentication
of the message can start, the key stream generator must be called twice to produce a 64-bit key for
the basic integrity EVAL_M function (refer to [UEA2-UIA2]), mainly consisting of a 64-bit polynomial multiplier.
The implementation of this multiplier is configuration dependent. In the High Speed configurations, a 1-cycle version of this multiplier is implemented. In the Medium Speed configuration, a 5cycle version is used. The multiplier uses a fixed polynomial as defined by the SNOW algorithm
(x64+x4+x3+x+1).
To calculate the final 32-bit authentication result, the SNOW key generation must generate three
additional parameters used to close the 64-bit multiplication sequence and XOR the final 32-bit
result.
258
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
SNOW-3G Engines
C.1.5
Examples
Listing C-1. Example 1:
Snow implementers test data, Test Set 1, [UEA2-UIA2-Test]:
Key:
IV In:
Data In:
Key Stream:
Data Out:
2BD6459F
EA024714
00000000
ABEE9704
ABEE9704
82C5B300 952C4910 4881FF48
AD5C4D84 DF1F9B25 1C0BF45F
00000000
7AC31373
7AC31373
SNOW Wide-bus Engine:
key_in[127:0]:
iv_in[127:0]:
mode_in[1:0]:
data_in[31:0]:
00000000
data_out[31:0]:
7AC31373
4881FF48 952C4910 82C5B300 2BD6459F
1C0BF45F DF1F9B25 AD5C4D84 EA024714
1
00000000
ABEE9704
Listing C-2. Example 2:
Snow implementers test data, Test Set 4, [UEA2-UIA2-Test] (non-zero input data added)
Key:
IV In:
Data In:
Key Stream:
Data Out:
0DED7263
6B68079A
12345678
D712C05C
C5269624
109CF92E
41A7C4C9
9ABCDEF0
A937C2A6
338B1C56
3352255A 140E0F76
1BEFD79F 7FDCC233
90021010 …… …… EBABEFAC
EB7EAAE3 …… …… 9C0DB3AA
7B7CBAF3 …… …… E26B3406
SNOW Wide-Bus Engine:
key_in[127:0]:
iv_in[127:0]:
mode_in[1:0]:
data_in[31:0]:
9ABCDEF0
90021010
…… ……
EBABEFAC
data_out[63:0]:
338B1C56
7B7CBAF3
…… ……
E26B3406
140E0F76 3352255A 109CF92E 0DED7263
7FDCC233 1BEFD79F 41A7C4C9 6B68079A
0
12345678
C5269624
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
259
Appendix C: Miscellaneous Accelerator Specifications
C.1.6
Operations
C.1.6.1 General Operations
This section describes the control and programming sequences of the SNOW-3G from a user perspective. It focuses on the typical/practical cases for regular, medium and maximum sized key
data blocks.
For a regular operation, mode, key, IV and data must be provided to the SNOW-3G. The mode,
key and IV must be provided together or before the first key data block. For authentication modes
(UIA2 / 128-EIA1), the length must also be included together with the mode, key and IV or before
the first key data block.
C.1.6.2 Encryption Modes: UEA2 / 128-EEA1
For a default encryption/decryption operation the following mode must be programmed in the
SNOW Mode register: 3’b011 / 3’b010.
•
bit [0]: selects encryption /decryption (for this mode, this bit has no effect and can be set to any
value)
•
bit [2:1]: selects UEA2/128-EEA1
In addition, the host must provide the following parameters:
•
key, used to seed the LFSR
•
iv, used to seed the LFSR
•
data, provided in 32-bit blocks via the data input bus
The number of 32-bit ciphertext blocks must match the number of provided 32-bit input data
blocks.
Note: SNOW is a stream cipher and for this reason does not require padding. However, the
SNOW-3G is 32-bit block oriented, which means that data must be submitted as 32-bit
multiples. Therefore, if a message is a non-multiple of 32 bits, the last input block needs to
be filled up with a value, for example zeroes, to complete the 32-bit block. The amount of
remaining (invalid) bits from the last 32-bit result data block must be removed by the host
or external system, after the encryption/decryption operation.
Authentication Modes: UIA2 / 128-EIA1
For a authentication operation the following mode must be programmed in the SNOW Mode Register: 3’b100 or 3’b101.
•
bit [0]: sets DIRECTION bit
•
bit [2:1]: selects UIA2 / 128-EIA1
Besides the direction and mode of operation, the host must provide the following parameters:
•
key, used to seed the LFSR
•
iv, used to seed the LFSR
•
length (in bits), used to finalize the authentication operation data, provided in 32-bit blocks via
the data input bus (The number of 32-bit blocks provided to the SNOW-3G, must match the
length (rounded up to a 32-bit multiple) divided by 32.)
The only result data is the 32-bit MAC-I. This MAC is only provided on the data output bus if all
input data is provided to and processed by the SNOW core.
260
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
KASUMI Engines
Note: Since the SNOW-3G is 32-bit oriented and the amount of authentication data can be bit
aligned, the SNOW-3G pads the remaining 32 bits with zeroes if needed. Note that the host
provides 32-bit blocks of data, meaning that when the length is non 32-bit aligned, bits from
the last 32-bit word are forced to zero by the SNOW-3G.
Refer to “Glossary, Conventions and Standards” on page 585 for information on conventions and
standards.
C.2 KASUMI Engines
C.2.1
Introduction
C.2.1.1 Specification Summary
The KASUMI engine supports the following features:
•
Key scheduling hardware
•
f8 and f9 algorithm support
•
Automatic data padding mechanism for f9 algorithm
•
KASUMI encryption and decryption modes
KASUMI Engine
mode [3:0]
config [31:0]
fresh [31:0]
f8 / f9 Wrapper
KASUMI
Key Scheduling
key [127:0]
KASUMI
Calculation
data [63:0]
data [63:0]
Figure C-2: KASUMI Engine Diagram
C.2.2
KASUMI Engine Functional Description
C.2.2.1 General Processing
The KASUMI engine is an efficient implementation of the KASUMI cipher algorithm, the f8 confidentiality algorithm and the f9 integrity algorithm.
The KASUMI engine contains three operational modes:
•
KASUMI mode (encrypt and decrypt)
•
f8 mode
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
261
Appendix C: Miscellaneous Accelerator Specifications
•
f9 mode
In general, the plain message data is parsed and sequentially fed into the engine in 64-bit blocks.
The processing of one 64-bit data block takes 8 rounds. A round takes one clock cycle.
It is not possible to interleave two packets using f8 and/or f9 mode. This means a full f8 or f9
operation must be finished before the next one can be started.
KASUMI Mode
During KASUMI mode, the 64-bit plaintext data blocks are encrypted/decrypted under a 128-bit
key, which is programmed into the KASUMI engine by the Host. During each round, the Round
Keys are derived from this initial key. After eight processing rounds, the KASUMI encrypted
cipher text data block will be stored in the output data register.
The KASUMI engine is able to operate in the following sub modes:
•
Encrypt mode
•
Decrypt mode
When operating in encrypt mode, plaintext data is transformed into KASUMI cipher text data. In
decrypt mode, the KASUMI cipher text data is transformed back into the original plaintext data.
More information regarding the KASUMI algorithm can be found in [3GPP TS 35.202].
3.1.2 f8_mode
During f8 mode, the 64-bit message data blocks are transformed into 64-bit output data blocks
under the control of a 128-bit Confidentiality Key. The total message length can be up to 216 =
65536 bits.
Note: KASUMI f8 is a stream cipher and does not require padding for this reason. However, the
KASUMI engine is block oriented, which means that data must be submitted as 64-bit
multiples. Therefore, in case a message ending in a non-multiple of 64-bit, the last input
block needs to be filled up with some value, for example zeroes, to complete the 64-bit
block. The amount of remaining (invalid) bits from the last 64-bit result data block must be
removed, by the host or external system, after the encryption/decryption operation.
More information regarding the f8 algorithm can be found in [3GPP TS 35.201].
f9_mode
During f9 mode, the message data is processed in 64-bit chunks under the control of a 128-bit
Integrity Key. The result, after all message data has been processed, is a 32-bit MAC value. The
total length of the message data can be up to 2 16 = 65536 bits.
C.2.2.2 Examples
Listing C-3. Example 1: KASUMI Encrypt Session
[3GPP TS 35.203], chapter 3.3 (Test Set 1):
Key:
2BD6459F 82C5B300 952C4910 4881FF48
Data input:EA024714 AD5C4D84
Data output:DF1F9B25 1C0BF45F
262
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
KASUMI Engines
KASUMI engine Interface:
key_in[127:0]:2BD6459F 82C5B300 952C4910 4881FF48
data_in[63:0]:EA024714 AD5C4D84
data_out[63:0]:DF1F9B25 1C0BF45F
Listing C-4. Example 2: f8 Session
[3GPP TS 35.203], chapter 4.3 (Test Set 1):
Key:
2BD6459F 82C5B300 952C4910 4881FF48
Count:72A4F20F
Bearer:0C
Direction:1
Length:798 bits
Data input:7EC61272 743BF161 .. 9B134880
Data output:D1E2DE70 EEF86C69 .. 9339650F
KASUMI engine Interface:
key_in[127:0]:2BD6459F 82C5B300 952C4910 4881FF48
count_in[31:0]:72A4F20F
config_in[31:0]:00000019
data_in[63:0]:7EC61272 743BF161
:
9B134880 [00000000] <- pad to the right to a full block
data_out[63:0]:D1E2DE70 EEF86C69
:
9339650F [25915EE3] <- ignore encrypted padding
Listing C-5. Example 3: f9 Session
[3GPP TS 35.203], chapter 5.3 (Test Set 1):
Key:
2BD6459F 82C5B300 952C4910 4881FF48
Count:38A6F056
Fresh:05D2EC49
Direction:0
Length:189 bits
Data input:6B227737 296F393C 8079353E DC87E2E8 05D2EC49 A4F2D8E0
MAC output:F63BD72C
KASUMI engine Interface:
key_in[127:0]:2BD6459F 82C5B300 952C4910 4881FF48
count_in[31:0]:38A6F056
fresh_in[31:0]:05D2EC49
config_in[31:0]:00BD0000
data_in[63:0]:6B227737 296F393C
8079353E DC87E2E8
05D2EC49 A4F2D8E0
data_out[63:32]:F63BD72C
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
263
Appendix C: Miscellaneous Accelerator Specifications
C.3 Packet Processor — Programming
C.3.1
Introduction
C.3.1.1 Purpose
This section describes how to program the various protocols and modes within the Tilera’s packet
processor properly. This process involves:
•
Identifying key protocol concerns and programming procedures
•
Describing protocol processing flows and data flows within the packet processor providing
programming examples
C.3.1.2 Scope
This section specifically covers the use of the packet processor protocols, modes, and
onboard token and context interfaces for the supported configurations.
C.3.1.3 Abbreviation and Definitions
For a complete list of abbreviations and definitions, refer to “Glossary, Conventions and Standards” on page 585.
Note: Since TLS and DTLS protocols originate from the same SSL protocol, the abbreviation ICV
(Integrity Check Value) is interchangeable with the abbreviation MAC (Message
Authentication Code).
C.3.1.4 Data Flow Table
In this appendix we often refer to the data flow table, which represents data movement during
packet processing. The generic view of the table is shown in Table C-85.
Table C-85. Data Flow Table
N
<instruction
number>
Instruction
<instruction>
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
<source>
<data to
remove>
<data to
hash>
<data to encrypt /
decrypt>
<data for
the output>
<data for
context
record>
Each line of the table represents a separate token instruction for data movement and processing,
except the execution of the REMOVE RESULT instruction, which indicates action of the previously
scheduled instruction.
The Instruction field represents the name of the processing instruction.
The Source of data field represents the source of the data block. The Destination field
present the destination of the data during processing. It is possible for same data to go to several
different destinations at the same time.
264
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
The data flow tables use the following colors to highlight specific data types, as shown in Table C86.
Table C-86. Data Flow Tables
Color
Fields
Cyan
Bypassed data
Blue
Packet header data
Yellow
IV and other fields
Grey
Packet payload, padding etc.
Light-green
ICV, MAC or other message integrity or authentication field
Ochre
Additional fields used internally for processing
C.3.2
ARC4 Algorithm
The packet processor supports ARC4 algorithm according to the ARC4 specification as a basic
operation. The ARC4 state table is located the Extra Data immediately following the Context
Record. Pointers to the state record and the I-J Pointer are located in the context.
The ARC4 mode is controlled via the following bits:
•
State selection. Select stateless (0) or statefull (1) mode.
•
I-J Pointer. When set indicates that I-J Pointer and ARC4 state pointer are present in context
record.
•
Crypto store. Must be set to 1 for later reuse of I-J Pointer.
•
Initialize ARC4. Bit in packet based options field and is applicable only when context control words are loaded from the token. This bit, when set, enables initialization of ARC4 state
memory with the default state and overrules statefull mode of state selection field.
The fetching and update of ARC4 state is shown in Figure C-3.
Context
Memory
ARC4 Memory
CONTEXT_ACCESS Instruction
(I-J pointer update)
Context
Context Fetch (Including
I-J pointer and ARC4
state pointer)
ARC4 State Pointer
ARC4
State
(256 bytes)
ARC4 state fetch
CONTEXT
_ ACCESS instruction
(ARC4 state update)
Context DMA
MUX
I-J
Pointer
Context
ARC4
Core
Figure C-3: Fetching and Update of Context Record and ARC4 State
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
265
Appendix C: Miscellaneous Accelerator Specifications
When packet with ARC4 in statefull mode is started, the context record should contain I-J
Pointer and ARC4 state pointer fields. During context fetch, the ARC4 state will be automatically fetched into the dedicated ARC4 ram immediately after context fetch. Note that context
and ARC4 state are both in the Extra Data.
Update of the ARC4 state and I-J Pointer in the external context memory is done by using respectively two CONTEXT_ACCESS instructions.
C.3.3
AES-CCM for Basic Operations and IPSec Protocols
C.3.3.1 Introduction
The packet processor supports AES-CCM as a basic operation, as specified in RFC3610 and as part
of the IPSec ESP protocol. Refer to “Packet Processor References” on page 577 for more
information.
The AES-CCM combines two cryptographic mechanisms that are based on AES encryption. The
processing diagram is shown in Figure C-4.
AES
AES-CBC encrypt
Input for authentication
AES
16 bytes
16 bytes
B0
B1
IV
...
Bk
AES
AES
16 bytes
16 bytes
Bk+1
Bk+2
0
(pad)
aad
Tag (truncated)
16 bytes
...
Bm
0
(pad)
message
8/12/16 bytes
Input for authentication
and encryption
Header
Payload
S2
S1
AES-CTR encrypt
Counter blocks
Result ciphertext
Pad
AES
AES
16 bytes
16 bytes
A1
A2
Sn
16 bytes
C1
C2
AES
16 bytes
16 bytes
An
A0
...
Result of
CBC-MAC
S0 (trunc)
AES
...
16 bytes
Tag
16 bytes
8/12/16 bytes
Cn
MAC
(Truncated)
Figure C-4: AES-CCM Processing Diagram
One mechanism within CCM is the Counter (CTR) mode for confidentiality, which is specified in
[RFC3686]. The CTR mode requires the generation of a sequence of blocks, called counter blocks
that are unique for a given key.
The other cryptographic mechanism within CCM is an adaptation of the Cipher Block Chaining
(CBC) technique from [NIST SP800-38a] to provide assurance of authenticity. In the CBC technique an initialization vector is applied to the data to be authenticated. The final block of the
resulting CBC output serves as a Message Authentication Code (MAC) of the data. The algorithm
for generating a MAC is commonly called CBC-MAC. The same cryptographic key is used for
both the CTR and CBC-MAC mechanisms within CCM.
The CCM specification [RFC 3610] defines two parameters:
1.
266
M — The size of the authentication field, which can be any of the following: 4, 6, 8, 10, 12, 14
and 16 bytes.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
Although for IPSec ESP only 8 and 16 bytes are mandatory, the packet processor also supports
an authentication field size of 12 bytes for IPSec ESP, since this is the default in the ESP specification [RFC 4303]. For basic operations, all ICV sizes are supported.
2.
L — The size of the length field can have any value between 2 bytes and 8 bytes.
For AES-CCM with IPSec ESP, this length must be 4 bytes only. For Basic operations a length
of 2 bytes and 4 bytes is supported. A length of more than 4 bytes can be supported; but generally, this is not required.
C.3.3.2 Authentication
For authentication, the data are divided in a sequence of 128-bit blocks B0, B1... Bm (see also
Figure C-4). Then the AES-CBC-MAC function is applied to these blocks. The first block, B0 (the IV
for the AES-CBC function), is formatted as shown in Figure C-5.
Length (m) = 4 bytes
bit
Flags
byte
7
6
5
0
Adata
3
4
3
2
M length
(encoded)
Length (m) = 2 bytes
1
L length
(encode = 011)
2
1
Flags
Flags
7
6
0
Adata
byte
0
Nonce N ( Salt)
B_0
bit
0
5
3
4
3
2
M length
(encoded)
2
1
Nonce N ( Salt)
B_0
Nonce N (IV)
1
0
L length
(encoded = 001)
0
Flags
Nonce N (IV)
Length (m)
Length (m)
Figure C-5: Block B0 for Authentication
The values for the flags register are shown in Table C-5.
Table C-5. Flags Register for B0 Vector
Bit
Field Name
Description
[7]
Reserved
This bit must be set to 0.
[6]
Adata
Additional Authentication Data.
0
Length of AAD is zero.
1
Length of AAD is not zero.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
267
Appendix C: Miscellaneous Accelerator Specifications
Table C-5. Flags Register for B0 Vector
Bit
Field Name
Description
[5:3]
M Length
MAC Length. This field indicates the encoded size (length) of the authentication field
(MAC).
000
Illegal
001
Length M is 4 bytes
010
Length M is 6 bytes
011
Length M is 8 bytes
100
Length M is 10 bytes
101
Length M is 12 bytes
110
Length M is 14 bytes
111
Length M is 16 bytes
[2:0]
L Length
Encoded size of the
000
Illegal
001
Length L is
010
Length L is
011
Length L is
100
Length L is
101
Length L is
110
Length L is
111
Length L is
length of the length field.
2
3
4
5
6
7
8
bytes
bytes
bytes
bytes
bytes
bytes
bytes
If AAD data are present, as indicated by the Adata bit, then AAD length field plus additional
authentication data are added. The last block is padded with zeros to a full 16-byte block.
The AES-CBC-MAC is computed over all the data blocks, B0…Bm, and the final result is a tag T.
C.3.3.3 Encryption
Encryption uses the AES-CTR function to transform the plain text into cipher text and vice-versa.
The cipher input is a sequence of 128-bit counter blocks A1, A2 … An, and then A0 (see also
Figure C-4). The format of A0 is shown in Figure C-6. The Flags field provides information about
the length of the message data.
Shown in Figure C-6 are the layouts of the A0 vector for two cases (counter length is 4 bytes and
counter length is 2 bytes).
Counter Length = 4 bytes
bit
6
0
Flags
byte
A0
7
5
4
0
3
3
2
Counter Length = 2 bytes
1
Counter length
(encoded = 011)
000
2
1
Nonce N (Salt)
bit
0
6
0
Flags
byte
0
Flags
7
5
4
0
2
2
1
Nonce N (Salt)
Nonce N (IV)
1
0
Counter length
(encoded = 001)
000
3
A0
3
0
Flags
Nonce N (IV)
Counter (m)
Counter (m)
Figure C-6: Block A0 for Encryption
The values for flags register are shown in Table C-6.
268
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
Table C-6. Flags Register for A0 Vectora
Bit
Field Name
Description
[7:3]
Reserved
Must be set to 0
[2:0]
L Length
Encoded size of the
000
Illegal
001
Length L is
010
Length L is
011
Length L is
100
Length L is
101
Length L is
110
Length L is
111
Length L is
a.
length of the block counter.
2
3
4
5
6
7
8
bytes
bytes
bytes
bytes
bytes
bytes
bytes
The initial counter value of the A0 vector must be initialized to zero, in contrast to AESGCM/ GMAC, where the
initial counter it is initialized to 1.
Note: The initial counter value of the A0 vector must be initialized to zero, in contrast to AESGCM/GMAC, where the initial counter it is initialized to 1.
C.3.3.4 Implementation
The AES-CCM basic operation within the packet processor requires that A0 and B0 fields be preImplementation
The AES-CCM basic operation within the packet processor requires that A0 and B0 fields be precalculated by the host and provided to the packet processor. The B0 for hashing is provided via
the token and A0 – initial IV for counter mode is provided via the IV field in the context record.
The encrypted A0, which is XOR-ed with the TAG, is created by encrypting a block of zeros with
IV=A0. Later the encrypted result Y0 is removed from the output stream.
The authentication keys for XCBC engine are the following:
•
Key1 and Key2 are zero.
•
Key0 is the same as the cipher key, but with each word swapped.
Since A0 and B0 are pre-calculated by the host, any allowable values for M, L and Counter Length
(maximum 4) are supported. In the case of an ESP packet, the salt/nonce values come from the
Security Association.
The implementation summary is presented in Table C-7.
Table C-7. AES-CCM Supported Functionality
Functionality
Inbound
Key Length
128-, 192-, 256-bit
M, L
Any
Any
Counter Length
1–4
1-4
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Outbound
269
Appendix C: Miscellaneous Accelerator Specifications
Table C-7. AES-CCM Supported Functionality (continued)
Functionality
A0 Vector
Inbound
Outbound
From context
From token
From context
From context
IV a
From token (For ESP, IV is taken from
input)
From token (For ESP, IV is taken from
Counter
From context IV (Zero value)
From context IV (Zero value)
Flag
From token
From token
From token
From token
From token (For ESP, IV is taken from
input)
From token (For ESP, IV is taken from
From token
From token
From token
From token
Flag
a
Salt
B0 Vector
Salt
IV
a
a
Message length
AAD Length Field
a.
b.
contextb)
contextb)
The Salt and IV must be the same for A0 and B0 vectors.
IV in the context can be also generated by the PRNG.
For information about byte order, refer to the “Examples” on page 259.
C.3.3.5 Basic Operation
Introduction
This section explains how to perform basic inbound and outbound transforms using AES-CCM
basic operation with different parameters.
Context Control Words
The AES-CCM processing requires that correct mode of the engine be set in context control words.
The layout and settings of the control words are shown in the tables that follow.
Context – Control Word 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
0
1
1
0
0
-
-
-
1
-
-
-
-
ToP
packet-based options
0
key
0
crypto algorithm
0
reserved
0
digest type
0
hash algorithm
SEQ
0
reserved
MASK0
0
SPI
MASK1
context length
-
-
-
-
0
0
0
0
-
-
-
-
The applicable fields are listed in Table C-8.
270
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
Table C-8. Basic AES-CCM Context Control Word 0
Field
Value
Description
hash algorithma
001
XCBC with 128-bit Key 1.
010
XCBC with 192-bit Key 1.
011
XCBC with 256-bit Key 1.
digest type
10
Use dedicated hash algorithm (XCBC).
crypto algorithm
101
AES-128.
110
AES-192.
111
AES-256.
key
1
The Key is used in processing.
context length
*
See description in “context length” on page 437.
packet based options
0000
Default value.
ToP
(Type of Packet)
1110
For outbound (hash-then-encrypt operation).
0111
For inbound (decrypt-then-hash operation).
a.
The length of Key 1 for XCBC must be equal to AES key length. Choosing different key lengths for XBC and AES algorithms will result in incorrect results.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
271
Appendix C: Miscellaneous Accelerator Specifications
Context – Control Word 1
state selection
i-j-pntr
hash store
reserved
enc. hash result
pad type
0
0
0
0
0
1
0 0 0 0
0
0 0 0
crypto mode
seq. nbr. store
0
Feedback
disable mask upd.
0
IV0
reserved
0
IV1
reserved
0
IV2
reserved
0
IV3
reserved
0
digest cnt
reserved
0
IV format
reserved
0
crypto-store
reserved
0
reserved
address mode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
-
-
-
-
0 0 1 1 0
The applicable fields are listed in Table C-9.
Table C-9. Basic AES-CCM Context Control Word 1
Field
Value
Description
hash store
0
Do not save result digest into internal context register.
enc hash result
1
Use encrypted hash result (XOR operation).
pad type
000
No padding.
100
IPSec padding.
crypto store
0
Do not store result IV.
IV format
00
Full IV mode.
IV3...IV0
1001
IV0 + IV3 – for ESP inbound.
1111
16-byte IV (AES) – for all other cases.
110
AES-CTR with load/reuse of the counter.
crypto mode
Outbound Data Flow
The basic outbound operation for AES-CCM is executed according to Table C-10.
Table C-10. Basic Outbound AES-CCM
N
Instruction
DIRECTION
(Bypass data)
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
input
-
-
-
Bypass
-
data
272
INSERTa
(B0+AAD length field)
token
-
B0+AAD
length field
-
-
-
DIRECTIONa
(AAD data)
input
-
AAD data
-
AAD data
-
INSERTa
(zero padding for AAD
data)
instruction
-
Zeroes
-
-
-
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
Table C-10. Basic Outbound AES-CCM (continued)
N
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
REMOVE_RESULT
(schedule removal of S0
from the output)
output buffer
-
-
-
-
-
INSERT
(zeroes for S0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTION
(Payload)
input
-
Payload
Payload
Payload
-
INSERT
(zero padding for cipher
data)
instruction
-
Zeroes
-
-
-
Execution of
REMOVE_RESULT
instruction
output buffer
S0
-
-
-
-
INSERT
(result TAG field)
context
(hash result)
-
-
-
TAG
-
a.
In situations when there is no AAD data, the AAD length, and supplementary zero padding for AAD data are not
hashed.
1.
The outbound processing functions as follows:
2.
The bypass data are passed directly to the output.
3.
The B0, AAD length, AAD data and zero padding are inserted to the hash stream.
4.
Schedule an instruction to remove the S0 from the output buffer.
5.
Block of zeros is inserted to the cipher stream to create encrypted A0 – S0.
6.
The payload data are hashed, encrypted and then passed to the output.
7.
Additional zero padding is inserted to the hash stream.
8.
The S0 block is removed from the output buffer.
9.
Result TAG is appended to the output stream.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
273
Appendix C: Miscellaneous Accelerator Specifications
Inbound Data Flow
The basic inbound operation for AES-CCM is executed according to Table C-11.
Table C-11. Basic Inbound AES-CCM
N
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
DIRECTION
(bypass data)
input
-
-
-
Bypass
-
INSERT
(B0+AAD length field)
token
-
B0+AAD
length
-
-
-
DIRECTION
(AAD data)
input
-
AAD data
-
AAD data
-
INSERT
(zero padding for AAD data)
instruction
-
Zeroes
-
-
-
REMOVE_RESULT
(schedule removal of S0 from
the output)
output buffer
-
-
-
-
-
INSERT (zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTION
(encrypted Payload)
input
-
Payload
Payload
Payload
-
INSERT
(zero padding for cipher data)
instruction
-
Zeroes
-
-
-
Execution of REMOVE_RESULT
instruction
output buffer
S0
-
-
-
-
RETRIEVE
(TAG field)
input
-
-
-
-
TAG
VERIFY_FIELDS
(calculated TAG with the TAG
from the input)
context
-
-
-
-
-
The inbound processing functions as follows:
274
1.
The bypass data are passed directly to the output.
2.
The B0, AAD length, AAD data and zero padding are inserted to the hash stream.
3.
Schedule an instruction to remove the S0 from the output buffer.
4.
Block of zeros is inserted to the cipher stream to create encrypted A0 – S0.
5.
The payload data are hashed, decrypted, and then passed to the output.
6.
Additional zero padding is inserted to the hash stream.
7.
The S0 block is removed from the output buffer.
8.
Result TAG is retrieved from the output stream and stored in the context.
9.
Calculated TAG is compared with retrieved from the input stream.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
Packet Processor Examples
This section explains how the test vectors can be applied to the packet processor core. As an
example, the test vector #1 from the REC3610 reference (See Chapter F: RFC 3610 is shown below.
=============== Packet Vector #1 ==================
AES Key = C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF
Nonce =
00 00 00 03
02 01 00 A0
A1 A2 A3 A4
A5
Total packet length = 31. [Input with 8 cleartext header octets]
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E
CBC IV in: 59
CBC IV out:EB
After xor: EB
After AES: CD
After xor: C5
After AES: 9C
After xor: 84
After AES: 2D
CBC-MAC : 2D
00
9D
95
B6
BF
38
21
C6
C6
00
55
55
41
4B
40
5A
97
97
00
47
46
1E
15
5E
45
E4
E4
03
73
71
3C
30
A0
BC
11
11
02
09
0A
DC
D1
3C
21
CA
CA
01
55
51
9B
95
1B
05
83
83
00
AB
AE
4F
40
C9
C9
A8
A8
A0
23
25
5D
4D
04
04
60
CTR Start:
CTR[0001]:
CTR[0002]:
CTR[MAC ]:
00
85
46
2E
00
9D
71
46
00
91
7A
C8
03
6D
C6
EC
02
CB
DE
33
01
6D
9A
A5
00
DD
FF
48
A0 A1 A2 A3
E0 77 C2 D1
64 0C 9C 06
01
50
75
3A
Total packet length
00 01 02
F0 66 D0
E8 D1 2C
A1
1E
19
92
83
B5
B5
C2
A2
0A
0A
58
4A
8B
8B
C4
A3
2D
2D
B6
A5
40
40
06
A4
FE
FE
9E
8A
C7
C7
CC
A5
4B
4B
E7
F2
6C
6C
AA
00
90
90
F0
E6
A2
A2
54
17
D6
D6
91
86
EB
EB
2F
[hdr]
[msg]
[msg]
A4 A5 00 01
D4 EC 9F 97
DE 6D 0D 8F
= 39. [Authenticated and Encrypted Output]
03 04 05 06 07 58 8C 97 9A 61 C6 63 D2
C2 C0 F9 89 80 6D 5F 6B 61 DA C3 84 17
FD F9 26 E0
According to the Introduction and Implementation sections, the host should pre-calculate A0 and
B0 vectors. The A0 vector should be written to the IV fields of the context:
IV0
IV1
IV2
IV4
[31:0]:
[31:0]:
[31:0]:
[31:0]:
00000001 (Nonce + Flag)
00010203
A3A2A1A0
0000A5A4 (Initial counter = 0)
The B0 vector and AAD data length are provided by using INSERT from the token
instruction.
Word
Word
Word
Word
Word
0
1
2
3
4
[31:0]:
[31:0]:
[31:0]:
[31:0]:
[16:0]:
00000059 (CBC IV0)
00010203 (CBC IV1)
A3A2A1A0 (CBC IV2)
1700A5A4 (CBC IV3)
0800 (AAD length)
The cipher key is provided via Key field of the context.
Key
Key
Key
Key
0
1
2
3
[31:0]:
[31:0]:
[31:0]:
[31:0]:
C3C2C1C0
C7C6C5C4
CBCAC9C8
CFCECDCC
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
275
Appendix C: Miscellaneous Accelerator Specifications
The has key is provided via XCBC Key fields of the context
K1_0
K1_1
K1_2
K1_3
[31:0]:
[31:0]:
[31:0]:
[31:0]:
C0C1C2C3
C4C5C6C7
C8C9CACB
CCCDCECF
K2_0, K2_1, K2_2, K2_3: 00000000
K3_0, K3_1, K3_2, K3_3: 00000000
The result TAG is stored in result digest field and has the following byte order.
Hash_Result_0 [31:0]: 2CD1E817
Hash_Result_1 [31:0]: E026F9FD
C.3.3.6 ESP
Introduction
This section explains how to perform ESP inbound and outbound transforms using AES-CCM.
The AES-CCM is the first standard that defines a variable size ICV as a must. For IPSec operations
that use AES-CCM and ICV size of 8 and 16 bytes must be supported and 12 bytes can be supported. Since ICV size is a matter of configuring token instructions, all sizes are supported.
For both inbound and outbound cases, the following data is provided by the host:
•
For B0 vector, the Flag is calculated and concatenated with the Salt value and provided in the
token.
•
Message length and AAD length are provided in the token.
Outbound Flow
The AES-CCM for ESP outbound is executed according to Table C-12.
Table C-12. ESP outbound with AES-CCM
N
276
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
INSERT
(Flag and Salt)
token
-
Flag and Salt
-
-
-
INSERT
(IV)
context (IV1
offset)
-
IV
-
-
-
INSERT
(Msg. length + AAD length)
token
-
Msg. length +
AAD length
-
-
INSERTa
(SPI)
context
-
ESP SPI
-
ESP SPI
-
INSERT b
(Seq. num. high for ESN)
context
-
Seq. num.
high
-
-
-
INSERTa
(Seq. num. low)
context
-
Seq. num.
low
-
Seq. num.
low
-
INSERT
(zero padding for AAD data)
instruction
-
Zeroes
-
-
-
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
Table C-12. ESP outbound with AES-CCM (continued)
N
a.
b.
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
INSERT (IV)
context (IV1
offset)
-
-
-
IV
-
REMOVE_RESULT
(schedule removal of S0 from
the output)
output buffer
-
-
-
-
-
INSERT (zeroes for S0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTION
(payload)
input
-
Payload
Payload
Payload
-
INSERT
(IPSec crypto padding)
instruction
-
IPSec padding
IPSec padding
IPSec padding
-
INSERT
(zero padding for cipher
data)
instruction
-
Zeroes
-
-
-
Execution of REMOVE_RESULT
instruction
output buffer
S0
-
-
-
-
INSERT
(ICV field)
context
(hash result)
-
-
-
ICV
-
CONTEXT_ACCESS
(update sequence number)
context
-
-
-
-
sequence
number
For regular sequence numbering these instructions can be combined into one instruction.
Applicable in case of using the extended sequence numbering.
Outbound processing functions as follows:
1.
The word containing a Flag and Salt value is taken from the token and provided to the hash
engine.
2.
IV data is taken from the context and provided to the hash engine.
3.
Finally, message length and AAD length fields are taken from the token and provided to
the hash engine to complete (B0 + AAD length) vector.
4.
The ESP header is taken from the context and hashed. In case of ESN, upper sequence number is only hashed before hashing regular sequence number.
5.
Zero padding is provided to the hash engine to pad hash data to a hash block size.
6.
The IV is inserted in the output stream.
7.
Schedule an instruction to remove the S0 from the output buffer.
8.
Block of zeros is inserted to create S0 vector.
9.
Payload and optional padding are encrypted and then passed to the output stream.
10. Zero padding is inserted in the hash engine to pad payload data to a hash block size.
11. The S0 block is removed from the output buffer.
12. The calculated ICV is inserted in the output stream.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
277
Appendix C: Miscellaneous Accelerator Specifications
13. Updated sequence number is written back to the context record in memory.
Inbound Flow
The AES-CCM for ESP inbound is implemented according to Table C-13.
Table C-13. ESP Inbound with AES-CCM
N
a.
b.
278
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
RETRIEVE
(store ESP header in context)
input
-
-
-
-
ESP header
INSERT
(Flag and Salt for B0)
token
-
Flag and
Salt
-
-
-
RETRIEVE
(store IV in context and pass to
hash)
input
-
IV
-
-
IV
(IV1 offset)
INSERT
(message length field + AAD
length field)
token
-
msg. length
+ AAD
length
-
-
-
INSERTa
(ESP SPI)
context
-
ESP SPI
-
-
-
INSERTb
(Seq. num. high for ESN)
context
-
Seq. num.
high
-
-
-
INSERTa
(Seq. num. low)
context
-
Seq. num.
low
-
-
-
INSERT
(zero padding for AAD data)
instruction
-
Zeroes
-
-
-
REMOVE_RESULT
(schedule removal of S0 from
the output)
output
buffer
-
-
-
-
-
INSERT (zeroes for S0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTION
(encrypted Payload)
input
-
Payload
Payload
Payload
-
INSERT
(zero padding for cipher data)
instruction
-
Zeroes
-
-
-
Execution of REMOVE_RESULT instruction
output
buffer
S0
-
-
-
-
RETRIEVE
(ICV field)
input
-
-
-
-
ICV
VERIFY_FIELDS
(ICV, Padding, SPI, Seq.num)
context
-
-
-
-
-
CONTEXT_ACCESS
(update sequence number and
mask)
context
-
-
-
-
seq. number and
mask
For regular sequence numbering these instructions can be combined into one instruction.
Applicable in case of using the extended sequence numbering.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
The inbound processing functions as described below:
1.
The ESP header is retrieved from the input stream and stored in the context.
2.
The word containing Flag and Salt value in taken from the token and provided to the hash
engine.
3.
IV data are retrieved from the input stream and is stored in the context at the same time it is
passed to the hash engine.
4.
Finally message length and AAD length fields are taken from the token and provided to
the hash engine to complete (B0 + AAD length) vector.
5.
The ESP header is provided to the hash engine from the context. In case of ESN, upper
sequence number is only hashed before hashing regular sequence number.
6.
Zero padding is inserted in the hash engine to pad hash data to the hash block size.
7.
Schedule an instruction to remove the S0 from the output buffer.
8.
Block of zeros is inserted to create S0 vector.
9.
Payload is decrypted, de-padded, and then passed to the output stream.
10. Zero padding is inserted to the hash engine to pad the payload data to a hash block size.
11. The S0 block is removed from the output buffer.
12. The calculated ICV is retrieved from the output stream and stored in the context.
13. The ICV, padding, SPI, and sequence number are verified.
14. Updated sequence number and mask are written back to the context record in the memory.
C.3.4
Protocols
AES-GMAC/AES-GCM for Basic Operations and IPSec
C.3.4.1 Introduction
The packet processor supports AES-GCM as a basic operation and as part of ESP. The packet processor supports AES-GMAC as a basic operation and as part of ESP and AH protocols. For more
information refer to [REF41066] and [RFC4543] in “Packet Processor References” on page 577.
The AES-GCM/GMAC operation in packet processor is performed by usage GHASH and AESCTR sub-modules. During processing, Y0 for TAG encryption is created by encrypting a block of
zeros with the initial IV. The Y0 is removed later from the output stream.
C.3.4.2 Basic Operation
Introduction
This section explains how to perform basic inbound and outbound transforms using AES-GCM/
GMAC.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
279
Appendix C: Miscellaneous Accelerator Specifications
Outbound Data Flow
The AES-GCM/GMAC outbound processing is executing according to Table C-14.
Table C-14. Basic AES-GCM/GMAC Processing Flow
N
Instruction
DIRECTION
(Bypass data)
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
input
-
-
-
Bypass
-
data
DIRECTION
(AAD data)
input
-
AAD data
-
AAD data
-
REMOVE_RESULT
(schedule removal of Y0
from the output)
output buffer
-
-
-
-
-
INSERT
(zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTIONa
(Payload)
input
-
Payload
Payload
Payload
-
Execution of REMOVE_RESULT
instruction
output buffer
Y0
-
-
-
-
INSERT
(calculated TAG)
context
-
-
-
TAG
-
(hash result)
a.
This is only for AES-GCM. In case of AES-GMAC, there is no payload data. The whole packet is considered to be the
AAD data.
The outbound processing functions as follows:
280
1.
The bypass data are passed directly to the output stream.
2.
The AAD data are inserted in the hash stream.
3.
Schedule an instruction to remove the Y0 from the output buffer.
4.
A block of zeros is inserted in the cipher stream to create encrypted Y0.
5.
The payload data are hashed, encrypted and then passed to the output stream.
6.
The encrypted Y0 block is removed from the output buffer.
7.
The Result TAG is appended to the output stream.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
Inbound Data Flow
The AES-GCM/GMAC inbound processing is executing according to Table C-15.
Table C-15. Basic AES-GCM/GMAC Processing Flow
N
Instruction
DIRECTION
(pass bypass data to the output)
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
input
-
-
-
Bypass
-
data
DIRECTION
(AAD data)
input
-
AAD data
-
AAD data
-
REMOVE_RESULT
(schedule removal of Y0 from
the output)
output buffer
-
-
-
-
-
INSERT
(zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
a
input
-
Payload
Payload
Payload
-
RETRIVE
(Store TAG in the context)
input
-
-
-
-
TAG result
Execution of REMOVE_RESULT
instruction
output buffer
Y0
-
-
-
-
VERIFY
(calculated TAG)
context
-
-
-
-
TAG
DIRECTION
(Payload)
(hash
result)
a.
This is only for AES-GCM. In case of AES-GMAC, there is no payload data. The whole packet consists only of AAD data.
The outbound processing functions as follows:
1.
The bypass data are passed directly to the output.
2.
The AAD data are inserted to the hash stream.
3.
Schedule an instruction to remove the Y0 from the output buffer.
4.
Block of zeros is inserted in the cipher stream to create encrypted Y0.
5.
The payload data is hashed, decrypted, and then passed to the output stream.
6.
The TAG is retrieved from the input stream and stored in the context.
7.
The encrypted Y0 block is removed from the output buffer.
8.
Calculated TAG is compared with retrieved TAG in the context.
C.3.4.3 IPSec
Introduction
This chapter describes how to perform ESP transforms using AES-GCM/GMAC, and AH transforms using AES-GMAC algorithm.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
281
Appendix C: Miscellaneous Accelerator Specifications
Context Control Words
For IPSec processing the context control words must be configured correctly. The layout and
allowable settings of the control words are shown in the figures that follow.
Context – Control Word 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
0
0
1
0
1
0
1
1
-
-
-
-
-
-
-
-
-
options
0
packet-based
1
ToP
key
0
crypto algorithm
1
reserved
1
digest type
0
hash algorithm
SEQ
-
reserved
MASK0
-
SPI
MASK1
context
length
-
-
-
-
-
-
-
The applicable fields are listed in Table C-16.
Table C-16. IPSec with GMAC/GCM Context Control Word 0
Field
Value
Description
MASK1, MASK0
00
Outbound processing does not use mask.
01
Inbound processing with 64-bit mask.
11
Inbound processing with 128-bit mask.
SEQ
01
Use 32-bit packet number.
SPI
1
SPI value is used in processing.
hash algorithm
100
GHASH.
digest type
10
Use dedicated hash algorithm (GHASH).
crypto algorithm
101
AES-128.
110
AES-192.
111
AES-256.
Key
1
The Key is used in processing.
context length
*
See description in “context length” on page 437.
packet based options
0000
Default value.
ToP
(Type of Packet)
0110
Outbound encrypt-then-hash operation for AES-GCM.
1110
Outbound hash-then-encrypt operation for AES-GMAC.
1111
Inbound hash-then-decrypt operation for both GCM and GMAC.
Context – Control Word 1
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
282
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
state selection
i-j-pntr
hash store
reserved
enc. hash result
pad type
0
0
0
0
0
1
0 0 0 0
0
1 0 0
crypto mode
seq. nbr. store
0
Feedback
disable mask upd.
0
IV0
reserved
0
IV1
reserved
0
IV2
reserved
0
IV3
reserved
0
digest cnt
reserved
0
IV format
reserved
0
crypto-store
reserved
0
reserved
address mode
Packet Processor — Programming
-
-
-
-
0 0 0 1 0
The applicable fields are listed in Table C-17.
Table C-17. IPSec with GMAC/GCM Context Control Word 1
Field
Value
Description
hash store
0
Do not save result digest into internal context register.
enc hash result
1
Use encrypted hash result.
pad type
000
No padding.
100
IPSec padding.
111
IPSec zero padding.
crypto store
0
Do not store result IV.
IV format
01
Counter mode.
IV3..IV0
0111
12-byte IV (AES), when IV is from the context.
crypto mode
010
AES-CTR with counter initialized to 1.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
283
Appendix C: Miscellaneous Accelerator Specifications
ESP Outbound Flow
The ESP outbound flow with AES-GMAC is shown in Table C-18.
Table C-18. ESP Outbound with AES-GMAC
N
a.
b.
284
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
INSERTa
(ESP SPI)
context
-
ESP SPI
-
ESP SPI
-
INSERTb
(Seq. num. high)
context
-
Seq. num.
high
-
-
-
INSERTa
(Seq. num. low)
context
-
Seq. num.
low.
-
Seq. num.
low.
-
INSERT
(IV)
context
(IV1 offset)
-
IV
-
IV
-
REMOVE_RESULT
(schedule removal of Y0 from
the output)
output buffer
-
-
-
-
-
INSERT
(zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTION
(Payload)
input
-
Payload
-
Payload
-
INSERT b
(IPSec padding)
instruction
-
IPSec padding
-
IPSec padding
-
Execution of REMOVE_RESULT
instruction
output buffer
Y0
-
-
-
-
INSERT
(ICV field)
context
-
-
-
ICV
-
CONTEXT_ACCESS
(write out updated sequence
number)
context
-
-
-
-
Sequence
number
field
Applicable in case of using the extended sequence numbering.
For regular sequence numbering these instructions can be combined into one instruction.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
ESP Inbound Flow
The ESP inbound flow with AES-GMAC is shown in Table C-19 (regular sequence numbering).
Table C-19. ESP Inbound with AES-GMAC
N
a.
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
RETRIEVEa
(hash and store ESP
header)
input
-
ESP
header
-
-
ESP header
(SPI offset)
RETRIEVE
(store IV in context)
input
-
IV
-
-
IV
(IV1 offset)
REMOVE_RESULT
(schedule removal of Y0
from the output)
output buffer
-
-
-
-
-
INSERT (zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTIONa
(encrypted Payload)
input
-
Payload
Payload
Payload
-
Execution of REMOVE_RESULT
instruction
output buffer
Y0
-
-
-
-
RETRIEVE
(ICV field)
input
-
-
-
-
ICV
VERIFY_FIELDS (ICV,
Padding, SPI, Seq.num)
context
-
-
-
-
-
CONTEXT_ACCESS
(update sequence number
and mask)
context
-
-
-
-
seq. number
and mask
In case of extended sequence numbering, header is processed according to the Table C-18.
In case of using extended sequence numbering, retrieving, and hashing of the ESP header is done
by using the instructions listed in Table C-20.
Table C-20. ESP Header Processing in case of ESN with AES-GMAC
N
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
RETRIEVE
(hash and store
SPI)
input
-
ESP SPI
-
-
ESP SPI
(SPI offset)
RETRIEVEa
(store seq. num.
low.)
input
-
-
-
-
seq. num.
low.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
285
Appendix C: Miscellaneous Accelerator Specifications
Table C-20. ESP Header Processing in case of ESN with AES-GMAC (continued)
N
a.
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
INSERT
(Estimated Seq.
num. high)
context
-
Seq.
num.
high
-
-
-
INSERT
(Seq. num. low)
context
-
Seq.
num. low
-
-
-
This instruction will trigger estimation of upper sequence number.
AH Outbound Flow
The AH outbound processing is executed according to Table C-21.
Table C-21. AH Outbound with AES-GMAC
N
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
REMOVE
(Ethernet header)
input
Ethernet
header
-
-
-
-
INSERT
(Muted IP header)
token
-
Muted IP
header
-
-
-
DIRECTION
(non-muted IP header)
input
-
-
-
IP header
-
INSERT
token
-
AH header
-
AH header
-
(1st word of AH header)
286
1st word
1st word
INSERT
(SPI, Seq. num)
context
-
SPI,
Seq. num
-
SPI,
Seq. num
-
INSERT
(IV)
context
(IV1 offset)
-
IV
-
IV
-
REMOVE_RESULT
(schedule removal of Y0
from the output)
output buffer
-
-
-
-
-
INSERT
(zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
INSERT
(Zeroes as ICV field)
instruction
-
Zero
ICV
-
Zero
ICV
-
INSERTa
(zeroes to pad AH header)
instruction
-
Zero padding
-
Zero padding
-
DIRECTION
(Payload)
input
-
Payload
-
Payload
-
INSERTb
(seq. num. high)
context
-
seq. num.
high
-
-
-
Execution of REMOVE_RESULT instruction
output buffer
Y0
-
-
-
-
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
Table C-21. AH Outbound with AES-GMAC (continued)
N
a.
b.
c.
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
REPLACEc
(replace zero field with
result ICV)
context
-
-
-
ICV
-
CONTEXT_ACCESS
(update sequence number)
context
-
-
-
-
sequence
number
In case of IPv6, the length of the AH header should be a multiple of 64 bits, hence additional zero padding is necessary.
This instruction can be combined with the previous instruction (insertion of zeroes to ICV place).
Hash higher sequence number in case of extended sequence numbering.
When processing big packets, the result ICV is appended to the output packet so that the host processor can perform a
replace operation.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
287
Appendix C: Miscellaneous Accelerator Specifications
AH Inbound Flow
The AH inbound processing is executed according to Table C-22.
Table C-22. AH Inbound with AES-GMAC
N
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
REMOVE (Ethernet header)
input
Ethernet
header
-
-
-
-
INSERT
(Muted IP header)
token
-
Muted IP
header
-
-
-
DIRECTION
(IP header)
input
-
-
-
IP header
-
RETRIEVE
input
-
AH header
-
-
AH header
(1st word of AH header)
a.
b.
288
1st word
1st word
RETRIEVE
(SPI, Seq. num)
input
-
SPI,
Seq. num
-
-
SPI,
Seq. num
RETRIEVE
(IV)
input
-
IV
-
-
IV
(IV1 offset)
REMOVE_RESULT
(schedule removal of Y0 from
the output)
output buffer
-
-
-
-
-
INSERT
(zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
RETRIEVE
(Store ICV field in context)
input
-
ICV
-
-
ICV offset
DIRECTIONa
(zeroes to pad AH header)
input
-
Zero
padding
-
-
-
INSERT
(Zeroes as ICV field)
instruction
-
Zero
ICV
-
-
-
DIRECTION
(Payload – last AAD data)
input
-
Payload
-
Payload
-
INSERTb
(seq. num. high)
context
-
seq. num.
high
-
-
-
Execution of REMOVE_RESULT instruction
output buffer
Y0
-
-
-
-
VERIFY_FIELDS
(ICV, SPI, Seq.num)
context
-
-
-
-
-
CONTEXT_ACCESS
(update sequence number and
mask)
context
-
-
-
-
seq. num
and mask
In case of IPv6, length of AH header should be multiple of 64 bits, hence additional zero padding is necessary for the
ICV calculation and it is part of the input packet.
Hash high sequence number when extended sequence numbering is used.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Packet Processor — Programming
C.3.5
SRTP/SRTCP Protocols
C.3.5.1 Introduction
The packet processor supports basic acceleration of SRTP/SRTCP protocols. Basic acceleration
does not support full header processing. Therefore, the header should be generated by the host
processor and included in the packet.
The summary of supported features is listed in Table C-23.
Table C-23. SRTP/SRTCP Functionality
Functionality
Inbound
Outbound
IP header
Modification
Modification
UDP header
Bypass
Bypass
SRTP/SRTCP Header processing
Bypass
Bypass
IV processing
From context
From context
MKI field (can be optional)
Removal, Verification
Insertion (from SPI)
SRTP ROC
From Token
From Token
SRTCP E+Index
Removal
Insertion (from Token)
TAG (variable length)
Verification
Insertion
Cipher algorithm
Null-crypto, AES-ICM
Hash Algorithm
HMAC SHA1
C.3.5.2 Packet Format
The SRTP packet format is shown in Table C-24.
Table C-24. SRTP Packet Format
SRTP Packet
0
1
V=2
2
3
4
P
X
CC
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
M PT
Sequence Number
Timestamp
Synchronization Source (SSRC) Identifier
Contribution Source (CSRC) Identifier
RTP Extension (Optional)
Payload
RTP Padding
RTP Pad Count
SRTP MKI (Optional)
Authentication Tag (Recommended)
The Master Key Information (MKI) field is used by key management. The MKI field identifies the
master key from which the session key(s) were derived that authenticate and/or encrypt the particular packet. Note that the MKI must not identify the SRTP cryptographic context. The MKI can
be used by key management for the purposes of re-keying, identifying a particular master key
within the cryptographic context
The TAG field is used to carry message authentication data. The Authenticated Portion of an SRTP
packet consists of the RTP header followed by the encrypted portion of the SRTP packet. Thus, if
both encryption and authentication are applied, encryption must be applied before authentication
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
289
Appendix C: Miscellaneous Accelerator Specifications
on the sender side and conversely on the receiver side. The authentication tag provides authentication of the RTP header and payload, and it indirectly provides replay protection by
authenticating the sequence number. Note that the MKI is not integrity protected as this does not
provide any extra protection.
The SRTCP packet format is shown in Figure C-7.
SRTCP Packet
0
1
V=2
2
3
P
RC
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
PT=SR or RR
Length
PT=SDES=202
Length
Sender SSRC
Sender info
Report block 1
Report block 2
…
V=2
P
SC
SSRC / CSRC_1
SDES Items
…
E
SRTCP Index
SRTCP MKI (Optional)
Authentication tag
Figure C-7: SRTCP Packet Format
The bit E is set when the current SRTCP packet is encrypted.
The SRTCP index is a 31-bit counter for the SRTCP packet. The index is explicitly included in each
packet, in contrast to the “implicit” index approach used for SRTP. The counter must be cleared to
zero before the first SRTCP packet is sent, and must be incremented by one, modulo 2^31, after
each SRTCP packet is sent. In particular, after a re-key, this field must not be reset to zero again.
C.4 Context Control Words
For SRTP/SRTCP processing the context control words must be configured correctly. The layout
and allowable settings of the control words are shown in the figures that follow.
Context – Control Word 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
290
-
-
1
1
-
-
-
-
-
-
-
-
-
-
-
-
-
options
0
length
packet-based
-
ToP
key
0
crypto algorithm
0
reserved
0
digest type
0
hash algorithm
SEQ
0
reserved
MASK0
0
SPI
MASK1
context
-
-
-
-
-
-
-
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
The applicable fields are listed in Table C-25.
Table C-25. SRTP/SRTCP Context Control Word 0
Field
Value
Description
hash algorithm
010
SHA1 HMAC.
digest type
11
HMAC type of hash algorithm.
crypto algorithm
101
AES-128.
key
1
The Key is used in processing for cipher algorithms.
0
The Key is not used for Null-Crypto Mode.
context length
*
See description in “context length” on page 437.
packet based options
0000
Default value.
ToP
(Type of Packet)
0010
Outbound hash operation (for Null-Crypto Mode).
0110
Outbound encrypt-then-hash operation (for all other cipher algorithms).
0011
Inbound hash operation (for Null-Crypto Mode).
1111
Inbound hash-then-decrypt operation (for all other cipher algorithms).
disable mask upd.
seq. nbr. store
state selection
i-j-pntr
hash store
reserved
enc. hash result
pad type
0
0
0
0
0
0
0
0
0 0 0 0
0 0 0
digest cnt
0
-
-
-
-
-
crypto mode
reserved
0
Feedback
reserved
0
IV0
reserved
0
IV1
reserved
0
IV2
reserved
0
IV3
reserved
0
IV format
reserved
0
crypto-store
29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
reserved
31 30
address mode
Context – Control Word 1
-
-
-
-
The applicable fields are listed in Table C-26.
Table C-26. SRTP/SRTCP Context Control Word 1
Field
Value
Description
hash store
0
Do not store result digest into internal context register.
pad type
000
Not used.
crypto store
0
Do not store result IV (IV is unique for every packet).
IV format
00
Use full IV mode for IV processing.
digest Cnt.
0
Digest counter is not used.
IV3..IV0
1111
16-byte IV (AES-ICM).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
291
Appendix C: Miscellaneous Accelerator Specifications
Table C-26. SRTP/SRTCP Context Control Word 1 (continued)
Field
Value
Description
feedback mode
00
For AES.
crypto mode
011
ICM with 16-bit counter initialized to zero. The 16-bit counter rollover
should be detected by the host.
C.4.0.1 Outbound Processing
The updated UDP and RTP headers are provided as part of the input packet. The MKI field is
taken from the SPI field of the context record.
The outbound SRTP processing is executed according to Table C-27.
Table C-27. SRTP Outbound Processing
N
Instruction
Source of Data
Destination
Remove
Hash
Cipher
Output
Context
IPV4_CKS or IVP6
(modify IP header)
input
-
-
-
IP header
-
DIRECTION
(IP address)
input
-
-
-
IP address
-
DIRECTION
(UDP header)
input
-
-
-
UDP header
-
DIRECTION
(RTP header)
input
-
RTP
header
-
RTP
header
-
DIRECTION
(Payload data)
input
-
Payload
Payload
Payload
-
INSERT
(hash ROC value)
token
-
ROC
-
-
-
INSERT
(optional MKI)
context
(SPI field)
-
-
-
MKI
-
INSERT
(TAG field)
context
-
-
-
TAG
-
The outbound processing for SRTP functions as follows:
292
1.
The IPv4 or IPv6 header is updated with parameters of result packet (result length).
2.
The IP address is passed to the output.
3.
The UDP header is passed to the output.
4.
The RTP header is passed to the output.
5.
Payload data are encrypted and then hashed.
6.
The ROC value is inserted from the input token into the hash stream.
7.
The MKI field is inserted from the SPI field of context record.
8.
Result TAG is appended to the end of the packet.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
The outbound SRTCP processing is executed according to Table C-28.
Table C-28. SRTCP Outbound Processing
N
Instruction
Source of Data
Destination
Remove
Hash
Cipher
Output
Context
IPV4_CKS or IVP6
(modify IP header)
input
-
-
-
IP header
-
DIRECTION
(IP address)
input
-
-
-
IP address
-
DIRECTION
(UDP header)
input
-
-
-
UDP header
-
DIRECTION
(RTCP header)
input
-
SRTCP
header
-
SRTCP
header
-
DIRECTION
(Payload data)
input
-
Payload
Payload
Payload
-
INSERT
(E + Index value)
token
-
E + Index
-
E + Index
-
INSERT
(optional MKI)
context
(SPI field)
-
-
-
MKI
-
INSERT
(TAG field)
context
-
-
-
TAG
-
The outbound processing for SRTCP functions as follows:
1.
The IPv4 or IPv6 header is updated with parameters of result packet (result length).
2.
The IP address is passed to the output.
3.
The UDP header is passed to the output.
4.
The SRTCP header is hashed and passed to the output.
1.
Payload data are encrypted and then hashed.
2.
The E bit and Index value are inserted from the input token into the hash stream and to the
output.
3.
The MKI field is inserted from the SPI field of context record.
4.
Result TAG is appended to the end of the packet.
C.4.0.2 Inbound Processing
The UDP and RTP headers are not modified by packet processor during inbound transform.
The inbound SRTP processing is executed according to Table C-29.
The inbound processing for SRTP functions as follows:
1.
The IPv4 or IPv6 header is updated with parameters of result packet (result length).
2.
The IP address is passed to the output.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
293
Appendix C: Miscellaneous Accelerator Specifications
Table C-29. SRTP Inbound Processing
N
294
Instruction
Source of Data
Destination
Remove
Hash
Cipher
Output
Context
IPV4_CKS or IVP6
(modify IP header)
input
-
-
-
IP header
-
DIRECTION
(IP address)
input
-
-
-
IP address
-
DIRECTION
(UDP header)
input
-
-
-
UDP header
-
DIRECTION
(RTP header)
input
-
RTP
header
-
RTP
header
-
DIRECTION
(Payload data)
input
-
Payload
Payload
Payload
-
INSERT
(hash ROC value)
token
-
ROC
-
-
-
RETRIEVE
(optional MKI)
input
(SPI result)
-
-
-
-
MKI
RETRIEVE
(store TAG in the context)
input
-
-
-
-
TAG
VERIFY
(calculated TAG)
context
(hash result)
-
-
-
-
TAG
3.
The UDP header is passed to the output.
4.
The RTP header is hashed and passed to the output.
5.
Payload data are hashed and at the same time decrypted. Decrypted payload is send to the
output.
6.
The ROC value is inserted from the input token into the hash stream.
7.
The MKI is retrieved from the input and stored in SPI field of the context record.
8.
The TAG is retrieved from the input and stored in the result digest field.
9.
Calculated TAG is compared with the retrieved TAG in the context.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
The inbound SRTCP processing is executed according to Table C-30.
Table C-30. SRTCP Inbound Processing
N
Instruction
Source of Data
Destination
Remove
Hash
Cipher
Output
Context
IPV4_CKS or IVP6
(modify IP header)
input
-
-
-
IP header
-
DIRECTION
(IP address)
input
-
-
-
IP address
-
DIRECTION
(UDP header)
input
-
-
-
UDP header
-
DIRECTION
(RTP header)
input
-
RTP
header
-
RTP
header
-
DIRECTION
(Payload data)
input
-
Payload
Payload
Payload
-
INSERT
(hash E + Index value)
token
-
E + Index
-
-
-
RETRIEVE
(optional MKI)
input
(SPI result)
-
-
-
-
MKI
RETRIEVE
(store TAG in the context)
input
-
-
-
-
TAG
VERIFY
(calculated TAG)
context
(hash result)
-
-
-
-
TAG
The inbound processing for SRTCP functions as follows:
1.
The IPv4 or IPv6 header is updated with parameters of result packet (result length).
2.
The IP address is passed to the output.
3.
The UDP header is passed to the output.
4.
The SRTCP header is hashed and passed to the output.
5.
Payload data are hashed and at the same time decrypted. Decrypted payload is send to the
output.
6.
The E bit and Index are inserted to the hash stream and removed from the packet.
7.
The MKI is retrieved from the input and stored in SPI field of the context record.
8.
The TAG is retrieved from the input and stored in the result digest field.
9.
Calculated TAG is compared with the retrieved TAG in the context.
C.4.1
MACsec Protocol
C.4.1.1 Introduction
The packet processor supports inbound and outbound packet processing for MACsec. Support of
MACsec is implemented according to Table C-31.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
295
Appendix C: Miscellaneous Accelerator Specifications
Table C-31. MACsec Functionality
Functionality
Inbound
Outbound
Header processing
Removal
Insertion:
STI from token,
PN and SCI from context
IV processing
• From input header (with SCI)
• From input header and context
(without SCI)
From Context
Packet number
Verification
Generation. Overflow check.
ICV (16-byte)
Verification
Insertion
Confidentiality offset
Supported
Supported
Cipher suites
• Integrity and confidentiality (AES-GCM)
• Integrity only (AES-GMAC)
C.4.1.2 Packet Format
The format of MACsec packet is shown in Figure C-8.
MAC Protected Data Unit (MPDU)
6-Byte
6-Byte
macDA
MSDU
16-Byte (8-Byte without optional SCI)
macSA
SecTAG
Secure/User data
4-Byte: SecTAG Information (STI)
2-Byte
1b
ET
V
Cur 16-Byte only
1-Byte: TAG Control Information (TCI)
1b
1b
1b
1b
1b
ES
SC
SCB
E
ICV
C
2b
1-Byte
AN
SL
4-Byte 8-Byte
PN
SCI
Secure Channel Identifier
(optional)
Packet Number
<48: Data length
0: Length is provided by LMI
Association number
Changed text
Encryption
Single Copy Broadcast (EPON)
SCI included in SecTAG
End station
0: Version number
88-E5: MACSec EtherType
Figure C-8: MACsec Packet Format
296
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
Protection of MACsec packet is shown in Figure C-9.
Integrity
mac mac
DA SA
SecTAG
Payload
ICV
Conf offset
Confidentiality
Figure C-9: MACsec Protection
The way data are processed by the crypto engine is shown in Figure C-10.
Input frame
Integrity
mac mac
DA SA
SecTAG
Secure/User data ICV
Output frame
C onf offset
mac mac
DA SA
Confidentiality
AAD length
Inbound flow
SecTAG
Secure/User data ICV
0
0
A (Additional
authentication data)
1
P (Plain text)
(Byte count > AAD length)
AN D NOT(integrity only)
GCM-AES-128
(Byte count > AAD length )
AND NOT(integrity only )
SAK K[127:0]
SC I IV[95:32]
PN IV[31:0]
1
C (C hipher data )
outbound
flow
T
(128 b)
K (secret key 128b )
IV (Initialization Vector 96 b)
Inbound flow
=
Inbound fram e IC V
VALID
Note: SecTAG has reversed field order
Figure C-10: AES-GCM/GMAC-128 Data Flow
C.4.1.3 Context Control Words
MACsec processing requires the control words to be configured correctly. The layout and allowable settings of the control words are shown in the figures below.
Context – Control Word 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
-
-
1
0
-
-
-
1
-
-
-
-
-
-
-
-
-
options
0
length
packet-based
-
ToP
key
0
crypto algorithm
0
reserved
1
digest type
0
hash algorithm
SEQ
-
reserved
MASK0
-
SPI
MASK1
context
-
-
-
-
-
-
-
The applicable fields are listed in Table C-32.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
297
Appendix C: Miscellaneous Accelerator Specifications
Table C-32. MACsec Context Control Word 0
Field
Value
Description
MASK1, MASK0
00
Outbound processing does not use mask.
01
Inbound processing with 64-bit mask.
10
Inbound processing with 32-bit mask.
11
Inbound processing with 128-bit mask.
SEQ
01
Use 32-bit packet number.
SPI
0
SPI field is not used in processing.
hash algorithm
100
GHASH.
digest type
10
Use dedicated hash algorithm (GHASH).
crypto algorithm
101
AES-128.
key
1
The Key is used in processing.
context length
*
See description in “context length” on page 437.
packet based options
0000
Default value.
ToP
(Type of Packet)
0110
Encrypt-then-hash for outbound operation (AES-GCM).
1110
hash-then-encrypt for outbound operation (AES-GMAC)
1111
hash-then-decrypt for inbound operation (AES-GCM/GMAC)
Context – Control Word 1
298
state selection
i-j-pntr
hash store
reserved
enc. hash result
pad type
0
0
0
1
0
1
0
0
0
0
0
0
0
0
-
-
-
-
0
00
crypto mode
seq. nbr. store
1
Feedback
disable mask upd.
0
IV0
reserved
0
IV1
reserved
0
IV2
reserved
0
IV3
reserved
0
10 09 08 07 06 05 04 03 02 01
digest cnt
reserved
0
IV format
reserved
0
crypto-store
reserved
0
reserved
address mode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11
0
0
1
0
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
The applicable fields are listed in Table C-33.
Table C-33. MACsec Context Control Word 1
Field
Value
Description
disable mask update
0
For outbound and for inbound ‘out of order’ mode.
1
For inbound, mask update is disabled (sliding window only).
hash store
0
Do not store result digest into internal context register.
enc hash result
1
Use encrypted hash result.
pad type
000
Padding is not used.
crypto store
0
Do not store result IV.
IV format
00
Full IV mode.
IV3...IV0
0111
12-byte IV (AES).
crypto mode
010
AES-CTR with counter initialized to 1.
C.4.1.4 Outbound Processing
Input Data (Outbound)
In order to process outbound MACsec packets, the following data must be provided to the packet
processor:
•
12-byte MAC address
•
16-byte SecTAG
•
4-byte SecTAG information (STI) – composed of EtherType, TCI/AN, SL fields
•
4-byte packet number
•
8-byte Secure Channel Identifier (SCI)
•
User Data
•
Confidentiality offset
•
16-byte AES-GCM secret key (SAK)
•
16-byte AES-GCM hash key
The STI field is inserted from the token. The layout of the inserted word is shown in Figure C-11.
STI Field from Token
31 30 29 28 27 26 25 24 23
22
21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
MACsec EtherType
Byte 2
MACsec EtherType
Byte 1
AN
C
E
SCB
SC
ES
V
SL
(optional)
Figure C-11: MACsec SPI Field Layout
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
299
Appendix C: Miscellaneous Accelerator Specifications
The 4-byte packet number is inserted from the incremented value of context field sequence
number. At the beginning of the session, the sequence number field in the context must be set
to 0. This allows inserting values from 1 to 232 inclusive. When increment operation results in
overflow of the sequence number, the sequence number overflow error (E10) is generated (see
Table D-8, “Error Codes,” on page 430).
The SCI field, when present, is taken from the token.
Data Flow
The MACsec outbound processing with AES-GCM is executed according to Table C-34.
Table C-34. MACsec Outbound with AES-GCM
N
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
DIRECTION
(pass MAC address to the output)
input
-
MAC
address
-
MAC
address
-
INSERT
(insert STI field)
token
-
STI
-
STI
-
INSERT_CTX
(insert packet number)
context
(seq_num_
res)
-
PN
-
PN
(IV2 offset)
SCI
-
SCI
(IV0 offset)
INSERT_CTX
(insert SCI)
a.
token
a
REMOVE_RESULT
(schedule removal of Y0 from the
output)
output buffer
-
-
-
-
-
INSERT
(zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTION
(user data – integrity only)
input
-
User Data
Int. only
-
User Data
Int. only
-
DIRECTION
(user data – confidentiality and
integrity)
input
-
User Data
conf. and
int.
User
Data
conf. and
int.
User Data
conf. and
int.
-
Execution of REMOVE_RESULT
instruction
output buffer
Y0
-
-
-
-
INSERT
(ICV field)
context
-
-
-
ICV
-
CONTEXT_ACCESS
(write updated packet number)
context
(seq. num.)
-
-
-
-
Packet
Number
This instruction is necessary when SCI field is present in the MACsec packet.
The outbound processing functions as follows:
300
1.
The MAC address is passed directly to the output stream.
2.
The STI field is inserted from the token.
3.
The packet number is inserted from the context.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
4.
(incremented value) and also used as part of the IV vector.
5.
In case when SCI field is preset in the stream, SCI field is inserted from the input token to the
output stream, hashed and used as part of IV.
6.
Schedule an instruction to remove the Y0 from the output buffer.
7.
A block of zeros is inserted in the cipher stream to create encrypted Y0.
8.
The payload data, which is protected only by integrity are hashed and passed to the output.
9.
The payload data, which are protected by confidentiality and integrity are encrypted, hashed
and then passed to the output stream.
10. The encrypted Y0 block is removed from the output buffer.
11. The result ICV is appended to the output stream.
12. The incremented packet number is written back to the context memory.
C.4.1.5 Inbound Processing
Input Data (Inbound)
In order to process an inbound MACsec packet, the following data must be provided to the packet
processor:
•
Inbound MACsec packet
•
Confidentiality offset
•
16-byte AES-GCM secret key
•
16-byte AES-GCM hash key
The verification of packet number is done as follows:
The packet processor supports out or order packet numbers within the specified window. This
functionality is implemented using IPSec replay protection logic except that the sequence mask is
not updated. By programming the sequence number mask field, it is possible to specify any window from 1 to 128. The four mask fields of the context should be programmed with a number
based on this formula:
Mask[N-1:0]={N{1’b1}}<<“window size”
Note: According to the [MACsec] standard, receive packets with packet number equal to zero are
not allowed. This check must be done by the host before providing packets to the packet
processor.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
301
Appendix C: Miscellaneous Accelerator Specifications
Data Flow
The MACsec inbound processing with AES-GCM is executed according to Table C-35.
Table C-35. MACsec Inbound with AES-GCM
N
a.
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
DIRECTION
(pass MAC address to the output)
input
-
MAC
address
-
MAC
address
-
DIRECTION
(hash and remove
input
-
STI
-
-
-
RETRIEVE
(retrieve and hash packet number)
input
-
PN
-
-
PN
(result
seq.num.)
RETRIEVE a
(retrieve SCI)
input
-
SCI
-
-
(IV0 offset)
INSERT_CTX
Use packet number a part of IV
context
(result
seq.num)
-
-
-
-
(IV2 offset)
REMOVE_RESULT
(schedule removal of Y0 from
the output)
-
-
-
-
-
-
INSERT
(zeroes for Y0 generation)
instruction
-
-
Block of
zeroes
Block of
zeroes
-
DIRECTION
(user data – integrity only)
input
-
User
Data
Int. only
-
User Data
Int. only
-
DIRECTION
(user data – confidentiality and
integrity)
input
-
User
Data
conf.
and int.
User Data
conf. and
int.
User Data
conf. and
int.
-
Execution of REMOVE_RESULT instruction
output buffer
Y0
-
-
-
-
RETRIEVE
(store ICV field in the context)
input
-
-
-
-
hash result
offset
VERIFY
(calculated ICV and PN)
context
-
-
-
-
-
CONTEXT_ACCESS a)
(update packet number)
context
-
-
-
-
packet number
STI field)
This instruction is necessary when SCI field is present in the MACsec packet.
The inbound processing functions as follows:
302
1.
The MAC address is passed directly to the output stream.
2.
The STI field is hashed and removed from the packet.
3.
The packet number is retrieved from the input, hashed and stored in context for later check.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
4.
In case when SCI field is preset in the stream, SCI field is retrieved from the input, hashed
and used as part of the IV. Otherwise this instruction should be NOP instruction, because the
context register is read immediately after writing.
5.
Packet number is copied to IV0 field of the context.
6.
Schedule an instruction to remove the Y0 from the output buffer.
7.
A block of zeros is inserted in the cipher stream to create encrypted Y0.
8.
The payload data, which is protected only by integrity are hashed and passed to the output.
9.
The payload data, which are protected by confidentiality and integrity are decrypted, hashed
and then passed to the output stream.
10. The encrypted Y0 block is removed from the output buffer.
11. The packet ICV is retrieved from the input and stored in context for comparison.
12. The calculated ICV is compared with ICV from the packet. The packet number is checked
according to Table D-41 on page 488.
13. The updated packet number is written back to the context memory.
Inbound Checks
The MACsec processing token can contain instructions for performing a number of inbound
checks. The available checks are:
•
Sequence number check – out of the window check without replay protection. An out of window packet failure will cause a Sequence number failure error.
•
ICV check – the calculated ICV value during inbound processing is compared with the ICV
value received from the input stream. In case of a mismatch, the authentication failure
error is generated.
C.4.2
DTLS Protocol
C.4.2.1 Introduction
The packet processor supports DTLS protocol without length field processing. This is because the
packet processor is designed as a stream processor and does not have the ability to “look to the
end of the packet” to decrypt the last two words.
Therefore, in the case of block ciphers, and to allow single-pass processing, the external host must
have corrected the length field before submitting the packet to the packet processor.
C.4.2.2 Supported Features
Support of DTLS processing is implemented according to Table C-36.
Table C-36. DTLS Functionality
Functionality
Inbound
Outbound
Header
Removal
Checking type and version, epoch
Insertion
Length field processing
By the host
By the host
IV processing
From input
Insertion from context
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
303
Appendix C: Miscellaneous Accelerator Specifications
Table C-36. DTLS Functionality (continued)
Functionality
Inbound
Outbound
Sequence number
Verification with 64-bit or 128-bit
mask
Generation.
Overflow check.
Fragment compression/decompression
Null
Null
MACa
Verification
Insertion
Crypto padding
Removal and verification
Insertion
Cipher algorithms
Null-crypto
DES, 3DES
AES-CBC (128, 192, 256-bit key)
Hash
HMAC-MD5
HMAC-SHA1 (optional SHA2)
a.
Message Authentication Code (MAC).
C.4.2.3 Packet Format
The format of DTLS packet is shown on Figure C-12.
1-Byte
Type
2-Byte
Version
2-Byte
Epoch
6-Byte
Sequence Number
2-Byte
Length (of fragment )
Fragment
1-Byte
IV
Payload
MAC
Padding
L( pad)
Block size
0 <= L(pad) <= 255
Value = L (pad)
md5 -16 bytes
sha -20 bytes
20
21
22
23
<= 2^14 for Plaintext fragment
<= 2^14 + 1024 for Compressed fragment
<= 2^14 + 2048 for Encrypted fragment
Record sequence number
Chiper state change counter value
{254 ,255} – DTLS 1.0
- change _cipher _spec
- alert
- handshake
- application _data
Figure C-12: DTLS Packet Format
C.4.2.4 Context Control Words
DTLS processing requires the context control words to be configured correctly. The layout and
allowable settings of the control words are shown in the figures that follows.
Context – Control Word 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
304
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
-
-
1
1
-
-
-
-
-
-
-
-
-
-
-
-
-
options
0
length
packet-based
-
ToP
key
0
crypto algorithm
1
reserved
0
digest type
1
hash algorithm
SEQ
-
reserved
MASK0
-
SPI
MASK1
context
-
-
-
-
-
-
-
The applicable fields are listed in Table C-37.
Table C-37. DTLS Context Control Word 0
Field
Value
Description
MASK1, MASK0
00
Outbound processing does not use mask.
01
Inbound 64-bit mask.
11
Inbound 128-bit mask.
SEQ
10
Use 48-bit sequence number.
SPI
1
SPI value is used in processing.
Hash Algorithm
000
MD5 HMAC.
010
SHA1 HMAC.
Digest Type
11
HMAC type of hash algorithm is used.
Crypto Algorithm
*
Select applicable crypto algorithm (see Table D-10, “Control Word 0 Field
Encoding,” on page 437).
Key
1
The Key is used in processing for cipher algorithms.
0
The Key is not used for Null-Crypto Mode.
*
See description in (see “context length” on page 437).
0000
Default value.
0010
Outbound hash operation (for Null-Crypto Mode).
1110
Outbound hash-encrypt operation (for all other cipher algorithms).
0011
Inbound hash operation (for Null-Crypto Mode).
0111
Inbound decrypt-hash operation (for all other cipher algorithms).
Context Length
Packet Based Options
ToP
(Type of Packet)
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
305
Appendix C: Miscellaneous Accelerator Specifications
Context – Control Word 1
state selection
i-j-pntr
hash store
reserved
enc. hash result
pad type
1
0
0
1
0
0
-
-
-
0
1
0 0 0
-
-
-
-
-
crypto mode
seq. nbr. store
0
Feedback
disable mask upd.
0
IV0
reserved
0
IV1
reserved
0
IV2
reserved
0
IV3
reserved
0
digest cnt
reserved
0
IV format
reserved
0
crypto-store
reserved
0
reserved
address mode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
-
-
-
-
The applicable fields are listed in Table C-38.
Table C-38. DTLS Context Control Word 1
Field
Value
Description
Seq. Num. Store
1
Disable estimation of sequence number.
Hash Store
1
Store result digest into internal context register.
Pad Type
101
TLS Pad Type.
110
SSL Pad Type.
000
No padding for Null-crypto case.
Crypto Store
1
Store IV/ARC IJ pointer back in context.
IV Format
00
Use full IV mode for IV processing.
Digest Cnt.
0
Digest counter is not used.
IV3…IV0
0000
No IV (for Null-Crypto Mode).
0011
8-byte IV (DES, 3DES).
1111
16-byte IV (AES).
Feedback Mode
*
See Appendix D: on page 371.
Crypto Mode
*
See Appendix D: on page 371.
C.4.2.5 Outbound Processing
Introduction
This chapter explains how DTLS outbound processing can be done in the packet processor.
The DTLS outbound processing is shown in Figure C-13.
306
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
Outbound
Input
Type
Version
Epoch
Seq Num
Len (pay load)
Payload
Move before hash
Integrity
Epoch
Hash
Seq Num
Type
Version
Len (pay load)
Payload
MAC
Insert hashing result
Confidentiality
IV
Encrypt
Payload
MAC
Padding
Len (pad)
Len (frag)
Append after encrypt
Replace Len before transmit
Outbound
Output
Type
Version
Epoch
Seq Num
Len (frag)
Fragment
Figure C-13: DTLS Outbound Processing
The DTLS outbound is a hash-encrypt type of processing, where the calculation of hash value is
done and then part of the packet and calculated hash value are encrypted.
Input Data (DTLS Outbound)
The following data must be provided by the host to the packet processor in order to perform outbound DTLS processing:
•
Payload data
•
Header data
•
Type and version fields
•
Epoch/Sequence number
•
Length of the payload data
•
Length of the result fragment
•
Padding length
•
Cipher and hash keys
•
Packet IV for block cipher
The Type and Version fields are taken from the SPI field of the context record. The layout of
SPI field is shown below.
Context – SPI
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
Type
Version Major
Version Minor
0000 0000
The 2-byte epoch and 6-byte (48-bit) sequence number are taken from two sequence number fields
of the context record. The epoch value is placed in Sequence number 1 field as shown below.
Context – Sequence number 1
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
Epoch
Sequence number[47:32]
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
307
Appendix C: Miscellaneous Accelerator Specifications
In the case of context reuse, sequence number is auto incremented. The overflow of the 48-bit
sequence number results in sequence number overflow error (E10) (see Table D-8, “Error Codes,”
on page 430).
The three length fields must be pre-calculated and provided via the token fields:
•
Length of the payload data – used for MAC calculation.
•
Length of the padding data – used for insertion of padding data, so that the encrypted data have
length multiple of the cipher block size. In case of Null-crypto, padding is not necessary.
•
Length of the result fragment – the value is transmitted to the output as a fragment length and
is calculated by formula:
fragment_len = len(IV) + len(payload) + len(MAC) + len(total_pad_sequence).
When a block cipher is used, the IV is taken from the IV field of the context record. This IV is
inserted as part of the fragment.
Data Flow
The DTLS outbound processing is executed according to Table C-39.
Table C-39. DTLS Outbound Processing
N
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
DIRECTION
(Bypass data)
input
-
-
-
Bypass
-
INSERT
(epoch and seq. number)
context
-
epoch and
seq. num
-
-
-
INSERT
(lower seq. number)
context
-
lower seq.
num
-
-
-
INSERT
(type and version)
context
-
type and
version
-
type and
version
-
INSERT
(epoch and seq. number)
context
-
-
-
epoch and
seq. num
-
INSERT
(lower seq. number)
context
-
-
-
lower seq.
num
-
INSERT
(payload length)
token
-
payload
length
-
-
-
INSERT
(fragment length)
token
-
-
-
fragment
length
-
INSERT
context
-
-
-
IV
-
DIRECTION
(payload)
input
-
payload
payload
payload
-
INSERT
(MAC result)
context
-
-
MAC
MAC
-
(IV)
308
a
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
Table C-39. DTLS Outbound Processing (continued)
N
Instruction
INSERT
(padding)
Destination
Remove
Hash
Cipher
Output
Context
instruction
-
-
padding
padding
-
context
-
-
-
-
Sequence
number
(1)
CONTEXT_ACCESS
(update sequence number)
a.
Source of
Data
In case of block cipher.
Outbound processing functions as follows:
1.
Bypass data are directly copied to the output stream.
2.
Epoch and sequence number are inserted in the output stream.
3.
Type and version is provided at the same time to the hash engine and to the output stream.
4.
Epoch and sequence number are provided to the hash engine.
5.
Payload length field, provided by the host, is inserted in the hash engine.
6.
Fragment length, provided by the host, is inserted in the output stream.
7.
In case of block cipher, IV is inserted to the output stream.
8.
Payload data is passed to hash engine and to the crypto engine. The encrypted payload is
passed to the output.
9.
The calculated MAC value is provided to the crypto engine. The encrypted MAC is passed to
the output stream.
10. Optional padding data are provided to the crypto engine. The encrypted padding is passed to
the output stream.
11. The incremented sequence number is written back to the context record in memory.
C.4.2.6 Inbound Processing
Introduction
The DTLS inbound processing is shown in Figure C-14.
Inbound
Input
Type
Version
Epoch
Seq Num
Len (frag)
Fragment
Decrypt last two block’s
and extract last byte
Move before hash
Decrypt
phase 1
Len (pad)
Len (pay load) = Len (frag) - Len (IV ) - Len (MAC) - Len (pad) - 1
Hash
Epoch
Seq Num
Type
Version
Len (pay load)
Payload
MAC
Append after decrypt
Decrypt
phase 2
Inbound
output
Fragment
Payload
IV
Payload
RX. MAC
Padding
Len (pad)
After decrypt
Figure C-14: DTLS Inbound Processing
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
309
Appendix C: Miscellaneous Accelerator Specifications
The Decrypt phase 1 is used in the case of block cipher and is executed by the host.
Input Data (DTLS Inbound)
The following data must be provided by the host to the packet processor in order to perform
inbound processing:
•
Incoming packet
•
Header checking information
•
Type, Version fields
•
Epoch value
•
48-bit maximum received sequence number and current mask value
•
Pre-calculated length of the payload data
•
Cipher and hash keys
•
Packet IV for block cipher
The Type and Version fields for checking are taken from the SPI field of the context record. The
layout of SPI field is the same as described in “Input Data (Outbound)” on page 299. The packet
processor can check if type and version match the values from inbound packet. When a mismatch
is detected, SPI check error is generated.
The Epoch value and maximum received 48-bit sequence number are taken from two sequence
number fields of the context record as described in “Input Data (Outbound)” on page 299.
Two length fields must be pre-calculated by host and provided to the packet processor via the
token fields:
•
The pad length – is obtained after decrypting the two last blocks of data, when block cipher
algorithm is used. In the case of Null-crypto, padding is not necessary and decryption of last
two blocks is not required.
•
Length of the payload data – is used for MAC calculation. This length is calculated using the
following formula:
payload_len = len(fragment) – len(IV) – len(MAC) – len(total_pad_sequence).
The IV for the block cipher algorithm is taken from IV field of context record. This IV is retrieved
from the inbound packet.
310
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
Data Flow
The DTLS inbound processing is executed according to Table C-40.
Table C-40. DTLS Inbound Processing
N
Instruction and Explanation
Source
of Data
Destination
Remove
Hash
Cipher
Output
Context
DIRECTION
(Bypass data)
input
-
-
-
Bypass
-
RETRIEVE
(store Type/Version in the context)
input
-
-
-
-
Type/Version
(SPI offset)
RETRTIEVE
(epoch/seq.num)
input
-
epoch/
seq. num
-
-
epoch/
seq. num
RETRTIEVE
(lower seq.num)
input
-
lower seq.
num
-
-
lower seq.
num
INSERT
(insert stored type/version)
context
-
type/
version
-
-
-
INSERT
(payload length)
token
-
payload
length
-
REMOVE
(fragment length)
input
fragment
length
-
-
-
-
input
-
-
-
-
IV
(IV offset)
REMOVE_RESULT
(store position for removal ICV
from the output buffer)
-
-
-
-
-
-
DIRECTION
(Payload)
input
-
Payload
Payload
Payload
-
DIRECTION b
(MAC + padding)
input
-
-
MAC +
padding
MAC +
padding
-
Execution of REMOVE_RESULT
instruction
output
buffer
MAC
-
-
-
-
VERIFY_FIELDS
(verify seq.num, SPI, padding,
MAC)
-
-
-
-
-
CONTEXT_ACCESS
(update sequence number,
sequence mask in context record
in memory)
context
-
-
-
Sequence
number, mask
RETRIEVE
(IV)
a.
b.
a)
-
-
Only in case of block cipher.
For this instruction, length of padding must be known at the beginning (see “Input Data (DTLS Inbound)” on page 310).
The inbound processing functions as follows:
1.
The bypass data are directly copied to the output stream.
2.
Type and version from the inbound packet are stored in SPI field of the context.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
311
Appendix C: Miscellaneous Accelerator Specifications
3.
Epoch and sequence number are retrieved from the input packet, stored in the context and
passed to hash engine.
4.
Stored type and version are provided to the hash engine.
5.
Payload length field is taken from token and passed to the hash engine.
6.
Fragment length is removed from the input stream.
7.
The IV is retrieved from the input stream and stored in the context record.
8.
Remember the position of the MAC field in the output stream.
9.
Decrypt payload and then provide payload to the hash engine and also to the output stream.
10. Decrypt MAC and padding and then insert them to the output stream.
11. Verify SPI, sequence number, padding data and also compare calculated MAC with retrieved
MAC from the input.
12. In case of successful processing, the result sequence number and mask are written back to the
context record in memory.
Inbound Checks
The DTLS inbound token can contain instructions for performing a number of inbound checks.
The available checks are:
•
Type and version check – inbound Type and version fields are checked against Type and
version fields stored in the SPI field of the context. In the case of mismatch, the
SPI check failure error is generated.
•
Sequence number check. Sequence number field contains of two parts – epoch and actual
sequence number. During sequence number check, received epoch number is checked
against epoch number in the context. Also at the same time, the 48-bit sequence number is
checked for a replay condition. In the case of any of these two checks failing, the sequence
number check error is generated.
•
Pad verification – for block ciphers, sequence of padding bytes after decryption can be detected
and removed. Wrong padding sequence causes pad verification failure.
•
MAC check – the calculated MAC value during inbound processing is compared with the MAC
value received from the input stream. In case of a mismatch, the authentication failure
error is generated.
C.4.3
SSL/TLS Protocol
C.4.3.1 Introduction
The packet processor supports SSL3.0/TLS1.0/TLS1.1/TLS1.2 protocols without length field processing. This is due to the fast that the packet processor is designed as a stream processor and
does not have the ability to ‘look to the end of the packet’ to decrypt the last two words.
Therefore, to allow single-pass inbound processing, in the case of block ciphers, the external host
must:
•
Decrypt last two blocks of the packet
•
Extract the padding information from the decrypted data
•
Calculate payload length field
For single-pass outbound processing, host must pre-calculate fragment length, based on used
cipher and hash algorithms.
312
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
C.4.3.2 Supported Features
Support of SSL/TLS processing is implemented according to Table C-41.
Table C-41. Functionality for Processing TLS/SSL Packets
Functionality
Inbound
Outbound
Header processing
Removal
Check type and version.
Insertion from context.
IV processing
From Fragment (TLS1.1/TLS1.2)
From Context (SSL/TLS1.0).
From Context.
Sequence number
Overflow check.
Generation.
Overflow check.
Fragment compression/decompression
Null.
Null.
MAC
Verification.
Insertion.
Crypto padding
Removal and verification. Pad
length is checked by the host.
Insertion.
Cipher algorithms
Null-crypto, ARC4 with key length from 40 to 128,
DES, 3DES, AES-CBC (128, 192, 256-bit key).
Hash
MD5, SHA1
(SSL-MAC for SSL, HMAC for TLS).
C.4.3.3 Packet Format
The combined packet format for SSL/TLS is shown in Figure C-15.
1-Byte
2-Byte
2-Byte
Type
Version
Length (of fragment )
Fragment
1-Byte
IV
Payload
MAC
Padding
TLS 1.1 only :
Block size
L( pad)
SSL: 0 <= L(pad) < Block Size
TLS: 0 <= L (pad) <= 255
for block ciphers only
md5 – 16 bytes
sha – 20 bytes
SSL: Value = L(pad)
TLS: Value = L (pad)
for block ciphers only
<= 2^14 for Plaintext fragment
{3,0} – SSL 3.0
{3,1} – TLS 1. 0
<= 2^14 + 1024 for Compressed fragment
<= 2^14 + 2048 for Encrypted fragment
{3,2} – TLS 1. 1
20 - change _cipher _spec
21 - alert
22 - handshake
23 - application _data
Figure C-15: SSL/TLS Packet Format
The difference between SSL and TLS packet formats is that TLS1.1 and TLS1.2 packets contain an
explicit IV field for block cipher algorithms.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
313
Appendix C: Miscellaneous Accelerator Specifications
C.4.3.4 Context Control Words
SSL/TLS processing requires the context control words to be configured properly. The layout and
allowable settings of the control words are shown in the figure below.
SSL/TLS Context Control Word 0
Context – Control Word 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
-
-
1
1
-
-
-
-
-
-
-
-
-
-
-
-
-
options
0
length
packet-based
-
ToP
key
0
crypto algorithm
1
reserved
1
digest type
1
hash algorithm
SEQ
0
reserved
MASK0
0
SPI
MASK1
context
-
-
-
-
-
-
-
The applicable fields are listed in Table C-42.
Table C-42. SSL/TLS Context Control Word 0
Field
Value
Description
SEQ
11
Use 64-bit sequence number.
SPI
1
SPI value is used in processing.
Hash Algorithm
000
MD5 HMAC/SSL-MAC.
001
SHA1 SSL-MAC.
010
SHA1 HMAC (used for SSL-MAC).
Digest Type
11
HMAC type of hash algorithm is used.
Crypto Algorithm
*
Select applicable crypto algorithm (see Table D-10, “Control Word 0 Field
Encoding,” on page 437).
Key
1
The Key is used in processing for cipher algorithms.
0
The Key is not used for Null-Crypto Mode.
*
See description in “context length” on page 437.
0000
Default value.
0010
Outbound hash operation (for Null-Crypto Mode).
1110
Outbound hash-then-encrypt operation (for all other cipher algorithms).
0011
Inbound hash operation (for Null-Crypto Mode).
0111
Inbound decrypt-then-hash operation (for all other cipher algorithms).
Context Length
Packet Based Options
ToP
(Type of Packet)
314
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
SSL/TLS Context Control Word 1
Context – Control Word 1
state selection
i-j-pntr
hash store
reserved
enc. hash result
pad type
-
-
-
1
0
0
-
-
-
0
1
0
0
0
-
-
-
-
-
crypto mode
seq. nbr. store
0
Feedback
disable mask upd.
0
IV0
reserved
0
IV1
reserved
0
IV2
reserved
0
IV3
reserved
0
digest cnt
reserved
0
IV format
reserved
0
crypto-store
reserved
0
reserved
address mode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
-
-
-
-
The applicable fields are listed in Table C-43.
Table C-43. SSL/TLS Context Control Word 1
Field
Value
Description
Seq. Mum. Store
1
Disable estimation of sequence number.
State Selection
0
ARC4 stateless mode.
1
ARC4 statefull mode.
0
ARC4 I-J Pointer is not used.
1
ARC4 I-J Pointer is available.
Hash Store
1
Store result digest into internal context register.
Pad Type
101
TLS Pad Type.
110
SSL Pad Type.
000
No padding for ARC4 and Null-crypto.
Crypto Store
1
Crypto state is saved for the next packet.
IV Format
00
Use full IV mode for IV processing.
Digest Cnt.
0
Digest counter is not used.
IV3..IV0
0000
No IV (for Null-Crypto Mode, ARC4).
0011
8-byte IV (DES, 3DES).
1111
16-byte IV (AES).
Feedback Mode
*
See Appendix D: on page 371.
Crypto Mode
*
See Appendix D: on page 371.
I-J Pointer
C.4.3.5 SSL MAC
The SSL protocol uses SSL-MAC authentication algorithm. The SSL-MAC is a two stage authentication algorithm using SHA1 and MD5 hash functions.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
315
Appendix C: Miscellaneous Accelerator Specifications
For SSL-MAC, the following sequence of calculations must be done:
hash(MAC_KEY + pad_2 + hash(MAC_KEY + pad_1 + authenticated_message))
where:
•
+
Denotes concatenation
•
pad_1
The character 0x36 repeated 48 times for MD5 or 40 times for SHA1
•
pad_2
The character 0x5c repeated 48 times for MD5 or 40 times for SHA1
•
hash
Hashing algorithm derived from the cipher suite
Since pad_1 and pad_2 have different lengths for MD5 and SHA1, the packet processor handles SSL-MAC in two different ways:
•
For MD5, the first inner hash block (MAC_KEY + pad_1) must be pre-calculated by the host
and provided as inner digest in context record, in the same way as for HMAC processing.
The first outer hash block (MAC_KEY + pad_2) must be pre-calculated by the host and provided as outer digest in context record.
•
For SHA1, only the MAC_KEY is provided in the context record (inner digest field). The
rest of the calculations are done internally.
C.4.3.6 Outbound Processing
Introduction
This chapter explains how SSL/TLS outbound processing can be done in the packet processor.
The combined outbound processing for SSL/TLS is shown in Figure C-16.
SSL/TLS packet
Outbound
Input
Type
Version
Len (pay load)
Payload
Insert before hash
Hash
Seq Num
Type
Skip for SSL
Version
Len (pay load)
Payload
MAC
Insert hashing result
For block ciphers only
TLS 1.1 (3)DES and AES only
Encrypt
Outbound
Output
IV
Payload
MAC
Padding
Len (pad)
Append after encrypt
Type
Version
Len (frag)
Fragment
Len (frag)
Replace payload length
with fragment length
before transmit
Figure C-16: SSL/TLS Outbound Processing
The SSL/TLS outbound is a hash-encrypt type of processing, where the calculation of the hash
value is done and then part of the packet is encrypted.
316
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
Input Data (SSL/TLS Outbound)
The following data must be provided by the host to the packet processor in order to perform
outbound processing:
•
Payload data
•
Header data
•
Type and version fields
•
Length of the payload data
•
Length of the result fragment
•
64-bit sequence number
•
Cipher and hash keys
•
Packet IV and padding length for block cipher
The Type and Version fields are taken from the SPI field of the context record. The layout of SPI
field is shown on the figure below.
Context – SPI
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
Type
Version Major
Version Minor
0000 0000
The 8-byte (64-bit) sequence number is taken from two sequence number fields of the context
record (non-incremented). The internal sequence number counter is incremented for use in next
packet. The overflow of the 64-bit sequence number results in a sequence number overflow error
(E10) (see Table D-8, “Error Codes,” on page 430).
The three length fields must be pre-calculated and provided via the token fields:
•
Length of the payload data – used for MAC calculation.
•
Length of the padding data – used for insertion of padding data, so that encrypted data will
have a length multiple of the cipher block size.
•
Length of the result fragment – the value is transmitted to the output as a fragment length and
is calculated as:
len(fragment) = len(IV, for TLS1.1 and TLS1.2) + len(payload) + len(MAC)
+ len(pad) + 1.
Note:
That expression len(pad)+1 is the length of the total pad sequence.
The IV for the block cipher is taken from IV field of the context record. For the TLS1.1 and TLS1.2
protocols, this IV is inserted as part of the fragment.
Data Flow
The data flow and instructions for outbound SSL/TLS processing are shown in Table C-44.
Outbound processing functions as follows:
1.
Bypass data are directly copied to the output stream.
2.
Sequence number is inserted in the hash engine.
3.
Type and version is provided at the same time to the hash engine and to the output (see
Note 1 in the Table C-44).
4.
Payload length field, provided by the host, is provided to the hash engine.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
317
Appendix C: Miscellaneous Accelerator Specifications
Table C-44. Outbound SSL/TLS Processing Flow
N
a.
b.
c.
Instruction
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
DIRECTION
(Bypass data)
input
-
-
-
Bypass
-
INSERT
(seq. number)
context
-
seq. number
-
-
-
INSERT a
(type and version)
context
(SPI field)
-
type and
version
-
type and
version
-
INSERT
(payload length)
token
-
payload
length
-
-
-
INSERT
(fragment length)
token
-
-
-
fragment
length
-
INSERT b
(IV for block cipher)
context
-
-
-
IV
-
DIRECTION
(payload)
input
-
payload
payload
payload
-
INSERT
(MAC result)
context
-
-
MAC
MAC
-
INSERT
(padding only for block
ciphers)
instruction
-
-
padding
padding
-
CONTEXT_ACCESS c
(update crypto state)
context
-
-
-
-
crypto state
CONTEXT_ACCESS
(update sequence number)
context
-
-
-
-
Sequence
number
For SSL packets, the ‘Version’ field is not inserted into the hash stream, but still transmitted. Therefore this instruction
should be split in two insert instructions – one to hash and transmit Type field and the other to transmit the Version field.
The IV is inserted only for TLS1.1 and TLS1.2 in the case of block cipher algorithm.
For block ciphers and SSL/TLS1.0 protocols, this is the result IV; for the ARC4 algorithm this is the I-J Pointer and the
ARC4 state.
5.
Fragment length, provided by the host, is provided to the output stream.
6.
For TLS1.1/TLS1.2 and block cipher, IV is inserted in the output stream.
7.
Payload data are passed into hash engine and to the crypto engine. The encrypted payload is
passed to the output stream.
8.
The calculated MAC value is inserted in the crypto engine. The encrypted MAC is passed to
the output stream.
9.
Optional padding data are inserted in the crypto engine. The encrypted padding is passed to
the output stream.
10. Update result crypto state, which will be reused for the next packet.
11. The incremented sequence number is written back to the context record in memory.
318
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
C.4.3.7 Inbound Processing
Introduction
Inbound
Input
Type
Version
Len (frag)
Fragment
For block cipher ,
decrypt last two block’s
and extract last byte
Decrypt
phase 1
Len (pad)
block cipher only
TLS 1.1 block ciphers only
Decrypt
phase 2
IV
Payload
RX.MAC
Padding
Len (pad)
Len (pay load) = Len (frag) - Len (IV ) –
Len (MAC) - Len (pad) - 1
Insert before
hash from context
Hash
Seq Num
Type
calculate length
Version
Len (pay load)
Append after decrypt
Payload
MAC
Skip for SSL
Inbound
output
Payload
Figure C-17: SSL/TLS Inbound Processing
The SSL/TLS inbound processing is a decrypt-hash type of processing, where decryption is done
first and then part of the packet is hashed. Decryption phase 1 is done by the host and decryption
phase 2 is done by the packet processor.
Input Data (SSL/TLS Inbound)
The following data must be provided by the host to the packet processor in order to perform
inbound processing:
•
Incoming packet
•
Header checking information
•
Type, Version fields
•
64-bit current sequence number
•
Pre-calculated length of the payload data
•
Cipher and hash keys
•
Packet IV for block cipher
The Type and Version fields for checking are coming from the SPI field of the context record.
The layout of SPI field is the same as described in “Input Data (Outbound)” on page 299. The
packet processor can check if type and version is matching the values from inbound packet. In
case of mismatch, SPI check error is generated.
The current 8-byte (64-bit) sequence number is taken from two sequence number fields of the context record (non-incremented). The sequence number is incremented after successful processing
of the packet. The overflow of the internal 64-bit sequence number results in a
sequence number overflow error (see Table D-8 on page 430).
Two length fields must be pre-calculated by host and provided via the token fields:
•
The pad_len – which is obtained after decrypting the two last blocks of data for block cipher
algorithm. In the case of ARC4 and Null-crypto, padding is not necessary and decryption of
last two blocks is also not required.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
319
Appendix C: Miscellaneous Accelerator Specifications
•
Length of the payload data – used for MAC calculation. This length is calculated using the following formula:
len(payload) = len(fragment) – len(IV, for TLS1.1 and TLS1.2) - len(MAC)
– len(total_pad_sequence).
The IV for the block cipher is taken from IV field of context record. For the TLS1.1 and TLS1.2 protocols, this IV is retrieved from the inbound packet.
Data Flow
The data flow and instructions for inbound SSL/TLS processing are shown in Table C-45.
Table C-45. Inbound SSL/TLS Processing Flow
N
Instruction and Explanation
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
DIRECTION
(Bypass data)
input
-
-
-
Bypass
-
RETRIEVE
(store type/version in
the context)
input
-
-
-
-
type/
version
INSERT
(seq. number)
context
-
seq. number
-
-
-
INSERT a
(insert stored
context
-
type/
version
-
-
-
INSERT
(payload length)
token
-
payload
length
-
REMOVE
(fragment length)
input
fragment
length
-
-
-
-
input
-
-
-
-
IV
(IV offset)
REMOVE RESULT
(store position for removal
MAC from the output buffer)
-
-
-
-
-
-
DIRECTION
(payload)
input
-
payload
payload
payload
-
DIRECTION c
(MAC + padding)
input
-
-
MAC +
padding
MAC +
padding
-
Execution of REMOVE_RESULT
instruction
output buffer
MAC
-
-
-
-
VERIFY_FIELDS
(verify SPI, padding, MAC)
-
-
-
-
-
-
(SPI offset)
type/ver-
sion)
RETRIEVE
(IV)
320
b)
-
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Control Words
Table C-45. Inbound SSL/TLS Processing Flow
N
a.
b.
c.
d.
Instruction and Explanation
Source of
Data
Destination
Remove
Hash
Cipher
Output
Context
CONTEXT_ACCESS d
(update crypto state)
context
-
-
-
-
crypto state
CONTEXT_ACCESS
(update sequence number in
context record in memory)
context
-
-
-
-
Sequence
number
For SSL packet ‘Version’ field is not inserted in the hash stream, but still received. Therefore this instruction should hash
only the Type field.
The IV is retrieved from input packet only for TLS1.1 and TLS1.2 in case of block cipher.
For this instruction, length of padding must be known up front (see “Input Data (SSL/TLS Outbound)” on page 317).
For block ciphers and SSL/TLS1.0 protocols, this is result IV, for ARC4 algorithm this is I-J Pointer and ARC4 state.
The inbound processing functions a follows:
1.
The bypass data are directly copied to the output stream.
2.
Type and version from the inbound packet are stored in SPI field.
3.
Sequence number is taken from the context and passed to hash engine.
4.
Stored type and version are provided to the hash engine.
5.
Payload length field is taken from token and passed to the hash engine.
6.
Fragment length is removed from the input stream.
7.
For TLS1.1 and TLS1.2, the IV is retrieved from the input stream and stored in the context
record.
8.
Remember the position of the MAC field in the output stream.
9.
Decrypt payload and then provide payload to the hash engine and also to the output stream.
10. Decrypt MAC and padding and then insert them to the output stream.
11. Verify SPI, sequence number, padding data and also compare calculated MAC with retrieved
MAC from the input.
12. Update result crypto state, which will be reused for the next packet.
13. In case of successful processing, the incremented sequence number is written back to the
memory, where sequence number field in context record is located.
Inbound Checks
The SSL/TLS token can contain instruction for performing a number of inbound checks. The
available checks are:
•
Type and version check – inbound Type and version fields are checked against Type and
version fields stored in the SPI field of the context. In the case of a mismatch, the
SPI check failure error is generated.
•
Sequence number check – increment of internal sequence number can lead to overflow of 64bit counter. The situation, when the counter overflows causes a sequence number check
failure.
•
Pad verification – for block ciphers, the sequence of padding bytes after decryption can be
detected and removed. Wrong padding sequence causes pad verification failure .
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
321
Appendix C: Miscellaneous Accelerator Specifications
•
MAC check – the calculated MAC value during inbound processing is compared with the MAC
value received from the input stream. In the case of a mismatch, the
authentication failure error is generated.
C.5 Public Key Accelerator (PKA)
C.5.1
PKA Firmware Architecture Overview
The main objective of the PKA firmware is to manage the farm engines and AES core. This means
that commands given via the PKI command interface are handed off to the farm engines and
when a farm engine is ready with the assigned command, the result is either copied back via the
PKI command interface or used for the follow-up command on the same farm engine. Summarized, the following PKA firmware functionality can be distinguished:
•
Copy the given PKI commands to the applicable command/result caches in buffer RAM for
processing,
•
Select a command from command caches to process,
•
Copy and optionally decrypt the necessary PKA vector(s),
•
Hand-off the PKA vector(s) and command to a free farm engine,
•
Copy the result from the ready farm engine to the applicable result cache and optionally the
result PKA vector(s),
•
Copy the result(s) to the PKI command/result interface.
Note: Some PKI commands are immediately processed without farm engine involvement.
Figure C-18 shows an overview of all tasks and triggers based on the required PKA functionality.
322
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Command available
DMA Channel ready/error interrupt
CommandCopy
(Ring0..3)
Start task
Fork
CommandHandler
(Pre-farm)
DMA Channel ready/error interrupt or
PKCP interrupt
Resume task
(free farm)
Copy vectors and
registers data
to Farmx
Zeroize DMA
Channel FIFO
(vector copy)
Farm0..9
Start
ResultCopyRingx
task
Copy vectors and
registers data
to/from Farmx
Resume task
(free farm or
linked command
to process)
FarmHandler
(Post-farm)
Resume
task
(result written to ring)
Farmx
LNME reset and
memory zeroize
Farm
LNME reset &
memory zeroize
DMA Channel
ready/error
interrupt
Fork
FarmReady interrupt or
DMA Channel ready/error interrupt
Start
ResultCopyRingx
task
ResultCopy
(Ring0..3)
DMA Channel ready/error interrupt
Resume
task
(free result ring entry)
Figure C-18: PKA Task Overview
The responsibilities of the tasks are defined as follows:
•
CommandCopy
These tasks (one for each command/result ring) copies the PKI commands from the CPU
memory to the applicable command/result cache and starts the CommandHandler task if
needed.
•
CommandHandler
This task selects, based on the specified ring priority, a PKI command from the command
caches to process. For the selected PKI command the required PKA vectors are copied from
the CPU memory and optionally decrypted. Depending on the PKI command a farm engine is
allocated and started with a PKA command derived from the PKI command. When no farm
engine involvement is required, the result is copied to the applicable command/result cache
and the optional result PKA vectors are copied to the CPU memory.
•
Farm0..9
These tasks represent the farm engines. A farm engine executes the off-loaded PKA
command.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
323
Appendix C: Miscellaneous Accelerator Specifications
•
FarmHandler
This task is started when a farm engine is ready. The task selects the first farm engine that
reported ready, copies result and optionally the result PKA vectors. When the PKI command
is completed, the farm engine is released and the optional result PKA vectors are copied to the
CPU memory and the result is copied to the applicable command/result cache. When the PKI
command is not yet completed, the farm engine is loaded with the follow-up PKA command.
•
ResultCopy
These tasks (one for each command/result ring) copy the PKI results from the command/
result cache to the CPU memory. The CommandHandler or FarmHandler will start the
applicable task when necessary.
•
Zeroize DMA channel FIFO (vector copy)
This task zeroizes the DMA channel FIFO that was used during the PKA vector copy operation. The CommandHandler task will start this task when all PKA vectors are copied for the
PKI command.
•
Farm LNME reset and memory zeroize
This task initiates the LNME reset and zeroizes the farm engine memory. The FarmHandler
task will start this task when there is no PKA command (follow-up or the initial command of
the next PKI command) for the farm engine available.
The zeroize functionality implements the “FIPS (140-3)” on page 579, Security Level 3 functionality.
C.5.2
Command and Vector Copy and Zeroization
Figure C-19 shows how the PKI commands/results and PKA Vectors are copied between the CPU
Memory and the RAM areas of the PKA and which task is responsible for the copy operation. Also
included in these figures are important zeroize action points (see Figure C-19). This figure also
shows how it is performed for PKI commands that do not require a farm engine. Figure C-19
shows the PKI commands that do need a farm engine, with an optional follow-up PKA command
situation.
The zeroize functionality zeroizes the DMA channel FIFO after the copy operations, and all used
memory after the PKI command is off-loaded to a farm engine or is completed. In the case of farm
engine involvement, the LNME and farm memory are zeroized when the farm engine is not
directly needed for a PKA command.
In general the buffer RAM contains the ring configuration, command/result caches and the public
PKA master controller scratchpad. The secure RAM cannot be accessed from outside (CPU) and
contains the Key Decrypt Key management, the private PKA master controller scratchpad and the
data area for the CommandHandler and FarmHandler tasks. This data area is split into a pre-data
area for the CommandHandler and a post-data area for the FarmHandler task. The Farm RAM is
the workspace of the farm engine and contains the PKA input and output vectors and scratchpad.
324
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
CPU
Memory
Command(s)
Buffer
RAM
Command
Copy
DMA
Ch0
Zeroize DMA Ch FIFO
(separate task)
Secure
RAM
Command(s)
Command
Handle
DMA
Ch0
Input
Vector(s)
Input
Vector(s)
DMA
Ch0
Output
Vector(s)
Output
Vector(s)
Result
Copy
Result
DMA
Ch0
Result
Figure C-19: No Farm Engine Command and Data Copy
Zeroize DMA Ch FIFO
(separate task)
CPU
Memory
Command(s)
Command
Copy
DMA
Ch0
Buffer
RAM
Zeroize pre-data area
Secure
RAM
Farm
RAM
Command(s)
Command
Handle
DMA
Ch0
Input
Vector(s)
DMA
Ch0/1
Command
Handle
Input
Vector(s)
Registers
DMA
Ch1
DMA
Ch1
Input
Vector(s)
Registers
Farm
Ready
Registers
Zeroize
post-data
area
DMA
Ch2
DMA
Ch2
Intermediate
Vector(s)
Registers
DMA
Ch2
DMA
Ch2
Internal
Registers
Output
Vector(s)
Input
Vector(s)
DMA
Ch0
Output
Vector(s)
Registers
Output
Vector(s)
DMA
Ch2
DMA
Ch2
Result
Copy
Result
DMA
Ch0
Result
Zeroize post-data area
(also DMA ch FIFO)
Farm
Registers
Farm
Ready
Farm
Ready
Farm
Internal
Registers
Output
Vector(s)
Reset LNME
and zeroize
farm memory
when farm is
not used
anymore
(separate task)
Figure C-20: Farm Engine-Related Command and Data Copy
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
325
Appendix C: Miscellaneous Accelerator Specifications
C.5.3
PKI Command Interface
This document describes the PKA command interface, based on (up to four independent) command descriptor rings, with a separate result descriptor ring for each command ring. Only a small
part of this interface is built in hardware (the command and result counters with associated interrupt logic), most of the functionality is defined by the firmware running on the PKA master
controller Sequencer.
Some parts of this document go into internal details of operation that are not ‘visible’ outside the
device – this is meant as a kind of ‘reality check’ to prevent defining things that cannot work
within the hardware framework of the PKA.
C.5.4
Main PKI Command Interface
The main PKI command interface uses descriptor/result rings held in Host memory space.
Descriptors do not contain any vector data – they contain pointers to vectors in Host address
space.
C.5.4.1 Descriptor Ring Management
Descriptors are 32 bytes in size. Four separate command descriptor rings holding up to 64K
(65536) descriptors each can be used, each accompanied by a result descriptor ring of the same
size. Command and result descriptor rings can be co-located or placed at different (non-overlapping) locations in Host space.
If multiple rings are used, selection of the ring that supplies the next PKI command to execute is
normally done using rotating priority. It is also possible to place Ring 0 at a higher priority than
the remaining rings or to turn the rotating priority off (in which case Ring 0 gets the lowest priority and Ring 3 the highest priority).
It is recommended to use separate rings if large differences in execution times for commands are
expected. This prevents a lot of results for short execution time commands being stalled by one
result for a long execution time command. The reason for this is that most of the internal buffer
RAM is used to buffer command descriptors from each command ring – no new commands can be
loaded when the oldest command in this buffer has not completed yet.
Read/write pointers for the rings should be kept locally by the Host and the PKA master controller (the latter will use some words of buffer RAM to hold them, providing progress indication and
a re-sync capability). No true ‘ownership’ bits are used in the descriptors – these are not necessary
as the command counters can be used to figure out whether new commands can be written –
result descriptors contain two ‘written zero’ bits that can be used (by a driver) for ownership indications but are mainly intended to prevent result interrupt race problems.
C.5.4.2 Descriptor Ring Control/Status Words
The Host must write the ring base addresses, size and initial read and write pointers at the start of
the buffer RAM (see Table C-46 below) before writing the PKA_RING_OPTIONS word that contains the ring option settings. The ring configuration settings (addresses, pointers, and option
settings) are processed when the first command is given; they are copied into secure RAM before
actual use and take effect from then onwards. Changing the ring configuration can only be done
by cycling the PKA through a reset.
Table C-46 provides the layout of the 8K Byte buffer RAM. The separate control words are
described in the next sections. The buffer RAM address space ranges from 0x00000 to 0x01FFF.
326
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Table C-46. Buffer RAM Layout for PKI Interface
Byte Address
(Within Buffer RAM)
Control / Status
Word Name
Description
0x00000
CMMD_RING_BASE_0
Base address for command ring 0
0x00004
Reserved, write zero
-
0x00008
RSLT_RING_BASE_0
Base address for result ring 0
0x0000C
Reserved, write zero
-
0x00010
CMMD_RING_BASE_1
Base address for command ring 1
0x00014
Reserved, write zero
-
0x00018
RSLT_RING_BASE_1
Base address for result ring 1
0x0001C
Reserved, write zero
-
0x00020
CMMD_RING_BASE_2
Base address for command ring 2
0x00024
Reserved, write zero
-
0x00028
RSLT_RING_BASE_2
Base address for result ring 2
0x0002C
Reserved, write zero
-
0x00030
CMMD_RING_BASE_3
Base address for command ring 3
0x00034
Reserved, write zero
-
0x00038
RSLT_RING_BASE_3
Base address for result ring 3
0x0003C
Reserved, write zero
-
0x00040
RING_SIZE_0
Number and offset of descriptors in command and result
rings 0
0x00044
RING_SIZE_1
Number and offset of descriptors in command and result
rings 1
0x00048
RING_SIZE_2
Number and offset of descriptors in command and result
rings 2
0x0004C
RING_SIZE_3
Number and offset of descriptors in command and result
rings 3
0x00050
RING_RW_PTRS_0
Read pointer of command ring 0, write pointer for result ring
0
0x00054
RING_RW_PTRS_1
Read pointer of command ring 1, write pointer for result ring
1
0x00058
RING_RW_PTRS_2
Read pointer of command ring 2, write pointer for result ring
2
0x0005C
RING_RW_PTRS_3
Read pointer of command ring 3, write pointer for result ring
3
0x00060 –
0x0006F
Reserved, write zero
-
0x00070
PKA_RING_OPTIONS
Main control word
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
327
Appendix C: Miscellaneous Accelerator Specifications
Table C-46. Buffer RAM Layout for PKI Interface (continued)
Byte Address
(Within Buffer RAM)
Control / Status
Word Name
Description
0x00074
MASTER_FW_VERSION
Master firmware version information
0x00078 0x01FFF
Reserved, do not modify
(includes command caches for rings 0 – 3)
Command Ring Base Address Control Words (CMMD_RING_BASE_0 … _3)
CMMD_RING_BASE_0 (Read/Write), 18-bit Address in Host Target Window: 0x00000
CMMD_RING_BASE_1 (Read/Write), 18-bit Address in Host Target Window: 0x00010
CMMD_RING_BASE_2 (Read/Write), 18-bit Address in Host Target Window: 0x00020
CMMD_RING_BASE_3 (Read/Write), 18-bit Address in Host Target Window: 0x00030
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
X
X
X
X
X
X
X
X
X
Command ring base address
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Table C-47. Command Ring Base Address Control Words Bit Descriptions
Bits
Name
Type
Function
[31:0]
Command
ring base
address
R/W
This is the base address of one command ring in Host address space. For performance reasons, it is suggested to align the base address to an 8 byte boundary,
but this is not an absolute requirement. A command ring can be co-located with the
accompanying result ring, in which case their base addresses must be identical –
when not co-located, they might not have any overlap.
Result Ring Base Address Control Words (RSLT_RING_BASE_0 … _3)
RSLT_RING_BASE_0 (Read/Write), 18-bit Address in Host Target Window: 0x00008
RSLT_RING_BASE_1 (Read/Write), 18-bit Address in Host Target Window: 0x00018
RSLT_RING_BASE_2 (Read/Write), 18-bit Address in Host Target Window: 0x00028
RSLT_RING_BASE_3 (Read/Write), 18-bit Address in Host Target Window: 0x00038
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15
14 13 12 11 10 9
8
7
6
5
4
3
2
1
X
X
X
X
X
X
X
X
X
Result ring base address
X
328
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Table C-48. Result Ring Base Address Control Words Bit Descriptions
Bits
Name
Type
Function
[31:0]
Result ring
base address
R/W
This is the base address of one result ring in Host address space. For performance
reasons, it is suggested to align the base address to an 8 byte boundary, but this is
not an absolute requirement. A result ring can be co-located with the accompanying command ring, in which case their base addresses must be identical – when
not co-located, they might not have any overlap.
RING_SIZE_x
RING_SIZE_0 (Read/Write), 18-bit Address in Host Target Window: 0x00040
RING_SIZE_1 (Read/Write), 18-bit Address in Host Target Window: 0x00044
RING_SIZE_2 (Read/Write), 18-bit Address in Host Target Window: 0x00048
RING_SIZE_3 (Read/Write), 18-bit Address in Host Target Window: 0x0004C
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Descriptor offset
X
X
X
X
X
8
7
6
5
4
3
2
1
0
X
X
X
X
X
X
X
X
X
Ring size
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Table C-49. Ring Size and Descriptor Offset Control Words Bit Descriptions
Bits
Name
Type
Function
[31:16]
Descriptor offset
R/W
This field specifies the offset in bytes between the starting locations of command
descriptors, in the range 33 … 65535. Value 0 indicates that the descriptors are
adjacent (with actual offset of 32 bytes) – in that case, reading command descriptors is optimized to read more than one in a single DMA action. Values 1 … 32 are
reserved and should not be used. The accompanying result ring will have the same
(result) descriptor offset.
[15:0]
Ring size
R/W
This field specifies the size of a command ring in number of descriptors, minus 1.
Minimum value is 0 (for 1 descriptor); maximum value is 65535 (for 64K descriptors). The accompanying result ring will have the same size.
RING_RW_PTRS_x
RING_RW_PTRS_0 (Read/Write), 18-bit Address in Host Target Window: 0x00050
RING_RW_PTRS_1 (Read/Write), 18-bit Address in Host Target Window: 0x00054
RING_RW_PTRS_2 (Read/Write), 18-bit Address in Host Target Window: 0x00058
RING_RW_PTRS_3 (Read/Write), 18-bit Address in Host Target Window: 0x0005C
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Result ring write pointer
X
X
X
X
X
X
X
8
7
6
5
4
3
2
1
0
X
X
X
X
X
X
X
X
X
Command ring read pointer
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
X
329
Appendix C: Miscellaneous Accelerator Specifications
Table C-50. Ring Read/Write Pointers Bit Descriptions
Bits
Name
Type
Function
[31:16]
Result ring
write pointer
R/W
This field indicates the entry number in the result ring that will be written next by
the PKA. It is reset to zero after starting up and is updated after every result
descriptor write DMA operation. Pointers wrap around, the maximum value of this
field equals the value of the ‘Ring size’ field of the corresponding RING_SIZE_x
control word.
[15:0]
Command
ring read
pointer
R/W
This field indicates the entry number in the command ring that will be read next by
the PKA. It is reset to zero after starting up and is updated after every command
descriptor read DMA operation. Pointers wrap around, the maximum value of this
field equals the value of the ‘Ring size’ field of the corresponding RING_SIZE_x
control word.
PKA_RING_OPTIONS
PKA_RING_OPTIONS (Read/Write), 18-bit Address in Host Target Window: 0x00070
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
‘Signature’ byte
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Ring 0 in-order
X
4
Ring 1 in-order
X
5
Ring 2 in-order
X
6
Ring 3 in-order
X
Reserved
7
Zero KDKs
X
8
X
X
X
X
X
3
2
1
0
Ring
Ring
enable prio
control control
X
X
X
X
Table C-51. PKA Ring Options Control Word Bit Descriptions
330
Bits
Name
Type
Function
[31:24]
‘Signature’
byte
R/W
This byte must contain 0x46 – it is used because these options are transferred through
RAM which does not have a defined reset value. The PKA master controller keeps reading this word at start-up until the ‘Signature’ byte contains 0x46 and the ‘Reserved’ field
contains zero.
[23:9]
Reserved
[8]
Zero KDKs
R/W
If this bit is ‘1’, the PKI Key Decryption Keys (KDK) storage areas and associated control
words will be zeroed by internal FW during the boot-up procedure. This will indicate all
KDKs as being invalid. If this bit is ‘0’, it is assumed that the KDK storage and control
words have already been set up in secure RAM and they will be left intact during boot-up.
Note that this bit is (functionally) forced to ‘1’ during a High Assurance mode boot-up as
the KDK area is initially used to hold ‘farm’ engine firmware in that case.
[7:4]
Ring X
in-order
R/W
These bits indicate whether a result ring delivers results strictly in-order (‘1’) or that result
descriptors are written to the result ring as soon as they become available, so out-oforder, (‘0’). In the latter case, it is important that a driver tags each command descriptor
with a number to be able to figure out the command to which a result belongs.
Bits MUST be written with a 0 and ignored on a read.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Table C-51. PKA Ring Options Control Word Bit Descriptions (continued)
Bits
Name
Type
Function
[3:2]
Ring enable
control
R/W
This field specifies how many rings will be used:
‘00’ = ring 0 only,
‘01’ = rings 0 and 1,
‘10’ = rings 0, 1 and 2,
‘11’ = all four rings.
[1:0]
Ring prio control
R/W
Ring priority control.
This field specifies the ring priorities:
‘00’ = full rotating priority,
‘01’ = fixed priority (ring 0 lowest),
‘10’ = ring 0 has the highest priority (the remaining rings have rotating priority),
‘11’ = reserved, do not use.
To change ring options, the complete PKA must be cycled through reset.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
331
Appendix C: Miscellaneous Accelerator Specifications
MASTER_FW_VERSION
MASTER_FW_VERSION (Read/Write), 18-bit Address in Host Target Window: 0x00074
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Reserved
X
X
X
8
7
6
5
4
3
2
1
0
X
X
Master FW major version num- Master FW minor version num- Master FW patch level
ber
ber
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Table C-52. Master Firmware Version Information
Bits
Name
Type
Function
[31:24]
Reserved
R/W
Bits MUST be written with a 0 and ignored on a read.
[23:16]
Master FW
major version
number
R/W
Indicates the major version number of this master firmware release – the first full
release will have version 1.0, so the value of this field will be 0x01.
[15:8]
Master FW
minor version
number
R/W
Indicates the minor version number of this master firmware release – the first full
release will have version 1.0, so the value of this field will be 0x00.
[7:0]
Master FW
patch level
R/W
Indicates the ‘patch level’ of this master firmware release, will be 0 to start with.
Note: Although indicated as ‘Read/Write’, this memory location is meant to be handled as readonly.
Note: The first two instructions (words) of the main master firmware image also contain this
information. The first word holds the patch level in bits [7:0] and the minor FW version
number in bits [15:8] and the second word holds the major FW version number in bits [7:0].
C.5.5
PKI Command and Result Descriptors
C.5.5.1 Command Descriptor Contents
Command descriptors are 32 bytes (8 words of 32 bits) long. The Host indicates the presence of
new command descriptors in a ring by incrementing the command counter associated with that
ring (after the descriptor contents have been written). It is possible to ‘link’ command descriptors
so that one command can only be executed when the previous linked command has been executed. These linked descriptors must be transferred into the ring as a whole (that is the command
counter must be incremented by the number of linked commands).
332
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
PKI Command Descriptor
The generic layout of a PKI command descriptor is as follows:
PKI Command Descriptor (Read/Write)
3124
Driver status
(bits [30:29])
Linked (bit [31])
31:28
2316
Shift count /
odd powers
(bits [28:24])
KDK nr.
(bits [23:22])
Byte Offset
Encrypted vectors
bit mask
(bits [21:16])
27:24
00000
23:20
Pointer ‘E’ (bits [31:0]).
19:16
‘Tag’ word for driver use (bits [31:0]).
15:12
Pointer ‘D’ (bits [31:0]).
11:8
Pointer ‘C’ (bits [31:0]).
Length ‘B’ (bits [26:18])
7:4
Pointer ‘B’ (bits [31:0]).
3:0
Pointer ‘A’ (bits [31:0]).
158
70
00000000
Command
(bits [7:0])
0000000
Length ‘A’ (bits [10:2])
00
Table C-53. PKI Command Descriptor
Field
Description
Pointer ‘A’ … ‘E’
These words provide up to 5 parameter/result pointers in Host space. The
length of the parameters and results is a multiple of 4 bytes but the start
addresses can be at any byte boundary. It is allowed for pointers to point to the
same memory location.
‘Tag’ word for driver use
This complete word can be used by a Host driver to hold an identification value
or pointer for its own administration. This word will be present (unchanged) in
the result descriptor for this command.
Length ‘A’ / ‘B’
These fields indicate the length of input vectors in 32 bit words. In general,
they indicate the lengths of vectors ‘A’ and ‘B’ but their actual use depends on
the operation performed.
Command
This field indicates which command to execute. Commands include (almost) all
commands of a standard PKA Engine module as described in “PKI Command/
Result Specifics (Firmware Dependent)” on page 337 with higher protocol level
commands added. The standard PKA Engine commands do not use pointer
‘E’.
Encrypted vectors bit mask (only applicable for the PKAb and PKAd
This field indicates (with a ‘1’) which of the five input parameter vectors contains encrypted data that must be decrypted before use. If a parameter vector
contains sub-vectors, these are decrypted separately. Each command has specific rules as to which parameters can be provided in encrypted form. Illegal
selection of encrypted vectors results in an error. Bits [19:16] of the last
descriptor word control pointers ‘A’ … ‘D’, bit [21] controls pointer ‘E’ (bit [20]
should be kept zero).
NOTE: When encrypted vectors are used, be careful with the crypto mode
selection (see “PKI Key Decrypt Key Control Words” on page 368). When
using AES-ECB or AES-CBC, the sub-vectors must be multiples of 128 bits
long as these crypto modes only work on full-length AES blocks. When using
AES-CFB, AES-OFB or AES-CTR, this restriction is not applicable as the last
block processed for these modes does not need to be a full-length AES block.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
333
Appendix C: Miscellaneous Accelerator Specifications
Table C-53. PKI Command Descriptor (continued)
Field
Description
KDK nr
This field indicates which of the (up to) four Key Decryption Keys must be used
to decrypt the input parameter vectors specified by the ‘Encrypted vector bit
mask’. The value of this field is directly used as KDK number (in the range 0 …
3). Using an invalid KDK will result in an error.
Shift count / odd powers
This field is used to convey the number of bits to shift (in the range 0 … 31) for
shift left/right basic operations. It is used to convey the number of odd powers
to use for modular exponentiation operations (with and without CRT). Allowed
values for odd powers are in the range 1 … 16; trying to select a too-high value
results in an error.
Driver status.
These two bits are reserved for use by the Host driver. In a result descriptor,
they are forced to zero (as ‘Written zero’ bits). When overlapping the command
and result rings, they can be used to indicate the state of a descriptor block
(‘empty’, ‘command’ or ‘result’ – the last would get fixed code zero). When not
used, these bits can be kept zero. They do not influence the PKI command
handling.
Linked
This bit indicates the linked state of the descriptor block as follows:
0
Normal command descriptor.
1
Linked command descriptor. The next command in this ring can not be
executed before this one has finished execution.
Note: For linked descriptors, the PKA master controller will scan forward in the ring until it finds
the location of the first command descriptor following the linked descriptors (the last one of
the linked descriptors is a normal command descriptor). Handling the linked descriptors is
done in the same order as they are placed in the ring. Normal arbitration between
commands placed in rings is resumed with the command following the linked chain (if
any). In essence, a chain of linked descriptors is handled as a single command.
C.5.5.2 Result Descriptor Contents
After finishing a command, the PKA master controller will convert the original command descriptor (as described in Command Descriptor Contents) into a result descriptor and write this
descriptor to the result ring. The result status at the end of command execution is returned mostly
in empty fields of the command descriptor, except for the sixth word which held pointer E (which
is completely modified).
334
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
PKI Result Descriptor
The generic layout of a PKI result descriptor is as shown in Table C-54 (yellow fields are copied
from the command descriptor).
PKI Result Descriptor (Read/Write)
3124
Written zero
(bits [30:29])
Linked (bit [31])
31:28
27:24
2316
Shift count /
odd powers
(bits [28:24])
158
70
Encrypted vectors Result code
(bits [15:8])
bit mask
(bits [21:16])
Length ‘B’ (bits [26:18])
0000000
Command
(bits [7:0])
Length ‘A’ (bits [10:2])
00
CMP result
(bits [31:29])
00
KDK nr.
(bits [23:22])
Byte Offset
Modulo = 0 ([31])
Modulo MSW offset
(bits [28:18])
000000
Main result
MS bit offset
(bits [22:18])
19:16
‘Tag’ word for driver use (bits [31:0]).
15:12
Pointer ‘D’ (bits [31:0]).
11:8
Pointer ‘C’ (bits [31:0]).
7:4
Pointer ‘B’ (bits [31:0]).
3:0
Pointer ‘A’ (bits [31:0]).
00
00
Main result MSW offset
(bits [12:2])
00
Result = 0 ([15])
00
23:20
Table C-54. PKI Result Descriptor
Field
Description
Pointer ‘A’ … ‘D’
These words provided up to 4 parameter/result pointers, unchanged.
‘Tag’ word for driver use
This word is unchanged from the command descriptor and can be used by a
driver to match the result to a given command (for example when out-oforder result delivery is selected).
Main result MSW offset / Result = 0
These fields are copied almost directly from bits [10:0] respectively bit [15]
of the FARM_PKA_MSW_X register of the ‘farm’ engine on which the
command was executed. The only change is that the start offset for the main
result vector of the command has been subtracted here (the Host need not
concern itself with internal ‘farm’ data RAM management).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
335
Appendix C: Miscellaneous Accelerator Specifications
Table C-54. PKI Result Descriptor (continued)
Field
Description
Modulo MSW offset / Modulo = 0 / Main
result MS bit offset
These fields are copied almost directly from bits [10:0] respectively bit [15]
and bits [4:0] of the FARM_PKA_DIVMSW_X register of the ‘farm’
engine on which the command was executed. The only change is that the
start offset for the Modulo result vector of the command has been subtracted
here (the Host need not concern itself with internal ‘farm’ data RAM management). Note that only the basic Modulo and Divide operations return the
Modulo MSW offset and zero indication, all others return the Main result MS
bit offset (which will be zero if the Main result is zero).
NOTE: For the Compare, ECDSA, DSA and AES known answer commands
the ‘Main result MSW offset/Result = 0’ and ‘Modulo MSW offset/Modulo = 0/
Main result MS bit offset’ information must be ignored.
Length ‘A’ / ‘B’
These fields were used to indicate the length of input vectors in 32 bit words,
unchanged.
CMP result bits
These bits are only updated for a basic ‘Compare’ command and reflect the
state of bits [2:0] of the FARM_PKA_COMPARE_X register of the ‘farm’
engine on which the command was run.
Command
This field indicates which command was executed, unchanged.
Result code
This field indicates the global result after the operation. Value 0x00 indicates
that no errors were encountered. Other values reflect a warning or error,
refer to Table C-55 for a complete overview of result codes.
Encrypted vectors bit mask
This field indicates which of the five input parameter vectors were provided
in encrypted form, unchanged.
KDK nr
This field indicates which KDK had to be used to decrypt encrypted parameter vectors, unchanged.
Shift count / odd powers
This field was used to convey the number of bits to shift for shift left/right
basic operations or the number of odd powers to use for modular exponentiation operations, unchanged.
Written zero
These bits are forced to zero when writing out a result descriptor. They can
be used in conjunction with the ‘Driver status’ bits in the command descriptor to indicate the state of a descriptor block in a combined command/result
ring. For a separate result ring, these bits can be used to determine that the
writing of a result descriptor to Host space has been completed, provided
that the driver sets these bits non-zero for empty result descriptor blocks.
Linked
This field indicates the ‘linked’ state of the descriptor block as follows
(unchanged):
0
1
Normal result descriptor,
Linked result descriptor. This result descriptor does not contain the
final result of a linked chain of command descriptors.
Note: Linked command descriptors deliver linked result descriptors one-by-one, it is up to the
Host to ignore or use the intermediate results. If a linked command returns an error (that is
the Result code bit [15] is a 1), the error code is propagated to the last (non-linked) result
descriptor and linked commands following the error command are not executed.
Note: If a descriptor ring has in-order result delivery selected, the result counter is only
incremented for consecutive completed commands. This can mean that a result counter
increments very rapidly after a very time-consuming command that stalled other already
completed commands has been completed. With out-of-order result delivery, the result
336
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
counter is incremented immediately after a command has been completed and the result
descriptor has been written to Host space. This makes the driver more complex because it
must use the Tag word to keep track of commands and results but it can increase overall
system performance.
Note: The Written zero field in the last word of the result descriptor is used to prevent
interrupt race errors. These can happen when the
descriptor available interrupt reaches the Host processor before the descriptor
data has been written across the Host bus into the external memory. The interrupt handler
should poll the Written zero field to check that the data has indeed arrived before
relying on other information in the result descriptor.
C.5.5.3 PKI Command/Result Specifics (Firmware Dependent)
This section gives a point-by-point description of the various operations available within the
PKA. All inputs and output must be considered as unsigned integers.

CAUTION: Unless otherwise indicated, parameter vectors must not be input in encrypted form. This is to
protect the stored Key Decryption Keys against attacks.
Table C-55 lists the result codes that are currently defined. The highest bit of the result code bytes
(bit [15] of the last word in the result descriptor) indicates whether an error occurred (1) or not (0).
If an error occurred, result vectors are not written and only the result code in the result descriptor
conveys meaningful information.
Table C-55. PKI Result Code Values
Code
Description
0x00
No error
0x81
Modulus was even
0x02
Exponent was 0, result returned as value 1 (for a modular exponentiation)
0x83
Modulus was too short (less than 33 significant bits)
0x04
Exponent was 1, result returns input value (for a modular exponentiation)
0x85
Odd powers not in range 1 … 16
0x86
Result point of ECC operation is ‘at infinity’ – not a real error!
0x87
Unknown command
0x88
Illegal encrypted parameter use
0x89
Operand length error
0x8A
Farm memory too small for operation
0x8B
Modular inverse does not exist
0x8C
Operand value error
0x8D
(Intermediate) Result value error
0xC0
Memory deadlock error
Others
Reserved
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
337
Appendix C: Miscellaneous Accelerator Specifications
Command List
Table C-56 is the list of commands or functions. For information about command restrictions, see
“Restrictions on Input Vectors for PKCP Operations” on page 353.
Table C-56. Master Command List
338
Command (Description)
On page . . .
Add (Basic Arithmetic)
page 339
Subtract (Basic Arithmetic)
page 339
Add/Subtract Combination (Basic Arithmetic)
page 340
Multiply (Basic Arithmetic)
page 340
Divide (Basic Arithmetic)
page 341
Modulo (Basic Arithmetic)
page 341
Shift Left (Basic Arithmetic)
page 342
Shift Right (Basic Arithmetic)
page 342
Compare (Basic Arithmetic)
page 343
Copy (Basic Arithmetic)
page 343
Modular Exponentiation without CRT (Complex Arithmetic)
page 344
Modular Exponentiation with CRT (Complex Arithmetic)
page 345
Modular Inversion (Complex Arithmetic)
page 346
ECC Point Addition/Doubling (Complex Arithmetic)
page 346
ECC Point Multiplication (Complex Arithmetic)
page 347
ECDSA Signature Generation (High-Level PKA Operations)
page 348
ECDSA Signature Verification (High-Level PKA Operations)
page 349
DSA Signature Generation (High-Level PKA Operations)
page 350
AES Known Answer (Verify Operations)
page 352
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Add (Basic Arithmetic)
Command Code
0x01
Operation
A+BC
Inputs
‘A’
(length ‘A’, ≤ 130 words long),
‘B’
(length ‘B’, ≤ 130 words long)
Result
‘C’
(max (length ‘A’, length ‘B’) + 1 word long)
Possible Errors
• Illegal encrypted parameter use
• Operand length error
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Subtract (Basic Arithmetic)
Command Code
0x02
Operation
A−BC
Inputs
‘A’
(length ‘A’, ≤ 130 words long),
‘B’
(length ‘B’, ≤ 130 words long)
Result
‘C’
(max (length ‘A’, length ‘B’) long)
Possible Errors
• Illegal encrypted parameter use
• Operand length error
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
339
Appendix C: Miscellaneous Accelerator Specifications
Add/Subtract Combination (Basic Arithmetic)
Command Code
0x03
Operation
A+C−BD
Inputs
‘A’
(length ‘A’, ≤ 130 words long),
‘B’
(length ‘A’, ≤ 130 words long),
‘C’
(length ‘A’, ≤ 130 words long)
Result
‘D’
(length ‘A’ + 1 word long)
Possible Errors
• Illegal encrypted parameter use
• Operand length error
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Multiply (Basic Arithmetic)
Command Code
0x04
Operation
A×BC
Inputs
‘A’
(length ‘A’, ≤ 130 words long),
‘B’
(length ‘B’, ≤ 130 words long)
Result
‘C’
(length ‘A’ + length ‘B’ long)
Possible Errors
• Illegal encrypted parameter use
• Operand length error
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
340
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Divide (Basic Arithmetic)
Command Code
0x05
Operation
A mod B  C, A div B  D
Inputs
‘A’
(length ‘A’, must be ≥ length ‘B’, ≤ 130 words long)
‘B’
(length ‘B’, must be > 1, ≤ 130 words long)
‘C’
(length ‘B’ long),
‘D’
(length ‘A’ – length ‘B’ + 1 word long)
Result
Possible Errors
• Illegal encrypted parameter use.
• Operand length error.
• Modulus was too short.
Extra Status
Result (‘D’) = 0, Main result (‘D’) MSW offset, Modulo (‘C’) = 0, Modulo (‘C’) MSW offset
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Modulo (Basic Arithmetic)
Command Code
0x06
Operation
A mod B  C
Inputs
‘A’
(length ‘A’, must be ≥ length ‘B’, ≤ 130 words long),
‘B’
(length ‘B’, must be > 1, ≤ 130 words long)
Result
‘C’
(length ‘B’ long)
Possible Errors
• Illegal encrypted parameter use.
• Operand length error.
• Modulus was too short
Extra Status
Modulo (‘C’) = 0, Modulo (‘C’) MSW offset
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
341
Appendix C: Miscellaneous Accelerator Specifications
Shift Left (Basic Arithmetic)
Command Code
0x07
Operation
A shl “shift count”  C
Inputs
‘A’
(length ‘A’, ≤ 130 words long), shift count (range 0 … 31 bits)
Result
‘C’
(if “shift count” = 0: length ‘A’ long, else length ‘A’ + 1 word long)
Possible Errors
• Illegal encrypted parameter use
• Operand length error
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Shift Right (Basic Arithmetic)
Command Code
0x08
Operation
A shr “shift count”  C
Inputs
‘A’
(length ‘A’, ≤ 130 words long), shift count (range 0 … 31 bits)
Result
‘C’
(length ‘A’ long)
Possible Errors
• Illegal encrypted parameter use
• Operand length error
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
342
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Compare (Basic Arithmetic)
Command Code
0x09
Operation
compare values of ‘A’ and ‘B’
Inputs
‘A’
(length ‘A’, ≤ 130 words long),
‘B’
(length ‘A’, ≤ 130 words long)
Result
‘A’ = ‘B’ (Compare result = 001),
‘A’ < ‘B’ (Compare result = 010),
‘A’ > ‘B’ (Compare result = 100)
Possible Errors
• Illegal encrypted parameter use
• Operand length error
Extra Status
N/A
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Copy (Basic Arithmetic)
Command Code
0x0A
Operation
AC
Inputs
‘A’
(length ‘A’, ≤ 255 words long)
Result
‘C’
(length ‘A’ long)
Possible Errors
• Illegal encrypted parameter use
• Operand length error
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
343
Appendix C: Miscellaneous Accelerator Specifications
Modular Exponentiation without CRT (Complex Arithmetic)
Command Code
0x10
Operation
CA mod B  D
Inputs
‘A’
(length ‘A’, which must be in range 1 … 130 words), can be an encrypted vector,
‘B’
(length ‘B’, which must be in range 2 … 130 words), can be an encrypted vector,
‘C’
(length ‘B’).
Number of ‘odd powers’ (in the range 1 … 16).
Result
‘D’
(length ‘B’ + 1 word long)
Possible Errors
•
•
•
•
•
•
•
•
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Illegal encrypted parameter use
Operand length error
Modulus is even
Modulus too short
Exponent was 0
Exponent was 1
Odd powers out-of-range
Farm memory too small (probably due to too high odd powers setting)
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
344
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Modular Exponentiation with CRT (Complex Arithmetic)
Command Code
0x11
Operation
(Input mod Mod_P)Exp_P mod Mod_P  X,
(Input mod Mod_Q)Exp_Q mod Mod_Q)  Y,
(((X – Y) mod Mod_P) × Q_inv) mod Mod_P) × Mod_Q  Z,
Y+ZD
Inputs
‘A’
points to Exp_P followed by Exp_Q with possibly one buffer worda in between (both
length
‘A’, which must be in range 1 … 66 words), can be an encrypted vector,
‘B’
points to Mod_P followed by Mod_Q with one or two buffer wordsb in between
(both length ‘B’, which must be in range 2 …66 words), can be an encrypted vector,
‘C’
points to Q_inv (length ‘B’), can be an encrypted vector,
‘E’
points to Input (2 × length ‘B’ long).
The number of ‘odd powers’ (in the range 1 … 16).
Result
‘D’
(2 × length ‘B’ long)
Possible Errors
•
•
•
•
•
•
•
•
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Illegal encrypted parameter use
Operand length error
Modulus is even
Modulus too short
Exponent was 0
Exponent was 1
Odd powers out-of-range
Farm memory too small (probably due to too high odd powers setting)
a. The buffer word is inserted when length ‘A’ is odd.
b. One buffer word when length ‘B’ is odd, two buffer words when length ‘B’ is even.
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
345
Appendix C: Miscellaneous Accelerator Specifications
Modular Inversion (Complex Arithmetic)
Command Code
0x12
Operation
A–1 mod B  D
Inputs
‘A’
(length ‘A’, ≤ 130 words long),
‘B’
(length ‘B’, ≤ 130 words long)
Result
‘D’
(length ‘B’ long)
Possible Errors
•
•
•
•
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset
Illegal encrypted parameter use
Operand length error
Modulus is even
No inverse exists (GCD (A, B) ≠ 1)
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
ECC Point Addition/Doubling (Complex Arithmetic)
Command Code
0x14
Operation
Point addition (Pnt_A ≠ Pnt_C) or point doubling (Pnt_A = Pnt_C) on elliptic curve
y2 = x3 + ax + b (mod p),
Pnt_A + Pnt_C  Pnt_D
Inputs
a
‘A’
points to Pnt_A.x followed by Pnt_A.y with two or three buffer words
length ‘B’, which must be in range 2 … 24 words),
‘B’
points to modulus p followed by curve parameter a with two or three buffer wordsa in
between (both length ‘B’, curve parameter b is not used here), can be an encrypted
vector,
‘C’
points to Pnt_C.x followed by Pnt_C.y with two or three buffer wordsa in between (both
length ‘B’), can be an encrypted vector
Result
‘D’
points to Pnt_D.x followed by Pnt_D.y with two or three buffer wordsa in between (both
length ‘B’)
Possible Errors
•
•
•
•
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset.
Note that this information refers to Pnt_D.x only.
in between (both
Illegal encrypted parameter use
Operand length error
Modulus is even
Result point at infinity
a. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd.
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
346
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
ECC Point Multiplication (Complex Arithmetic)
Command Code
0x15
Operation
Point multiplication on elliptic curve y2 = x3 + ax + b (mod p), k × Pnt_C  Pnt_D
Inputs
‘A’
be
points to scalar multiplication value k (length ‘A’, must be in range 1 … 24 words), can
an encrypted vector.
‘B’
points to modulus p followed by curve parameters a and b with two or three buffer words
in between (all of length ‘B’, which must be in range 2 … 24 words), can be an
encrypted vector.a
‘C’
points to Pnt_C.x followed by Pnt_C.y with two or three buffer words in between (both
length ‘B’), can be an encrypted vector.
Result
‘D’
points to Pnt_D.x followed by Pnt_D.y with two or three buffer words in between (both
length ‘B’).
Possible Errors
•
•
•
•
•
Extra Status
Result = 0, Main result MSW offset, Main result MS bit offset.
Illegal encrypted parameter use
Operand length error
Modulus is even
Modulus too short
Result point at infinity
Note that this information refers to Pnt_D.x only.
a. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd.
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
347
Appendix C: Miscellaneous Accelerator Specifications
ECDSA Signature Generation (High-Level PKA Operations)
Command Code
0x20
Operation
Generate r and s values of an ECDSA signature using elliptic curve
y2 = x3 + ax + b (mod p, subgroup size n), base point Pnt_C, random value k, private key Alpha
and message digest h:
1. Check that k is in range 1 … n–1 (if not, return an operand value error).
2. Calculate r = x1 (mod n), where (x1,y1) = k·Pnt_C (an ECC point multiplya).
3. If r equals zero, return with a result value error (must re-try with different k).
4. Calculate s = k −1·(h + r·Alpha) (mod n) (k −1 is a modular inversion).
5. If s equals zero, return with a result value error (must re-try with different k).
Inputs
‘A’
points to private key Alpha (length ‘B’, which must be in range 2 … 24), can be an
encrypted vector.
‘B’
points to modulus p followed by curve parameters a, b, n and base point coordinates
Pnt_C.x followed by Pnt_C.y (all of length ‘B’), with two or three buffer wordsb between
all sub-vectors, can be an encrypted vector.
‘C’
points to message digest h (length ‘B’).
‘E’
points to random value k (length ‘B’).
Result
‘D’
points to r followed by s with two or three buffer words in between (both length ‘B’
– note that r and s must be non-zero, the algorithm must be re-run with a new k value if
any of these two end up being zero (indicated with a result value error).
Possible Errors
•
•
•
•
•
•
•
•
Extra Status
N/A
Illegal encrypted parameter use
Operand length error
Operand value error
Modulus is even
Modulus too short
No inverse exists
Result point at infinity
Result value error
a. Only the x-coordinate of the result is used, so the y-coordinate is not calculated.
b. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd.
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
348
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
ECDSA Signature Verification (High-Level PKA Operations)
Command Code
0x21 (With r” Write-Back), 0x25 (Without r” Write-Back)
Operation
Generate r” value (and r” = r’ indication) of an ECDSA verify using elliptic curve
y2 = x3 + ax + b (mod p, subgroup size n), base point Pnt_C, public key point Pnt_A, message
digest h and ECDSA signature values r’ and s’.
1. If r’ or s’ are outside the range 1 … n–1, return with an operand value error.
2. Calculate w = s’ −1 (mod n) (s’ −1 is a modular inversion).
3. Calculate u1 = h·w (mod n) and u2 = r’·w (mod n).
4. Calculate (x1,y1) = u1·Pnt_C + u2·Pnt_A (two point multiplies and one point add).
5. Calculate r” = x1 (mod n) and compare this value to the given r’.
Inputs
‘A’
points to public key Pnt_A.x followed by Pnt_A.y with two or three buffer wordsa
in between (both length ‘B’, which must be in range 2 … 24)
‘B’
points to modulus p followed by curve parameters a, b, n and base point coordinates
Pnt_C.x followed by Pnt_C.y (all of length ‘B’) with two or three buffer wordsa between all
sub-vectors, can be an encrypted vector.
‘C’
points to message digest h (length ‘B’).
‘E’
points to r’ followed by s’ with two or three buffer words in betweena (both length ‘B’).
Result
‘D’
points to r” (length ‘B’) – when using command code 0x21 returned for external
verification of a match,
r” = r’ (Compare result = 001),
r” < r’ (Compare result = 010),
r” > r’ (Compare result = 100).
Possible Errors
•
•
•
•
•
•
•
Extra Status
N/A
Illegal encrypted parameter use
Operand length error
Operand value error
Modulus is even
Modulus too short
No inverse exists
Result point at infinity
a. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd.
Note: The ‘E’ pointer to r’ and s’ is not present in the result descriptor. A driver is forced to keep
track of the pointer to r’ if external comparison with r” is required.
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
349
Appendix C: Miscellaneous Accelerator Specifications
DSA Signature Generation (High-Level PKA Operations)
Command Code
0x22
Operation
Generate r and s values of a DSA signature using prime p, sub-prime n, value g (g’s multiplicative
order modulo p equals q), random value k, private key Alpha and message digest h (the number
of significant bits in n should equal the length of h):
1. Check that k is in range 1 … n–1 (if not, return an operand value error).
2. Calculate r = ((gk mod p) mod n).
3. Calculate s = k −1·(h + r·Alpha) (mod n) (k −1 is a modular inversion).
4. If r or s equals zero, return with a result value error (must be re-run).
Inputs
‘A’
points to private key Alpha (length ‘B’, which must be in range 2 … length ‘A’ words), can
be an encrypted vector.
‘B’
points to prime p followed by value g (both length ‘A’, which must be in range 2 … 130
words), followed by sub-prime n (length ‘B’), with two or three buffer wordsa between all
sub-vectors, can be an encrypted vector.
‘C’
points to message digest h (length ‘B’).
‘E’
points to random value k (length ‘B’).
Number of ‘odd powers’ (in the range 1 … 16).
Result
‘D’
Possible Errors
•
•
•
•
•
•
•
•
•
Extra Status
N/A
points to r followed by s with two or three buffer wordsb in between (both length ‘B’) –
note that r and s must be non-zero, the algorithm must be re-run with a new k value if
any of these two end up being zero (indicated by returning a result value error).
Illegal encrypted parameter use
Operand length error
Operand value error
Modulus is even
Modulus too short
No inverse exists
Odd powers out-of-range
Result value error
Farm memory too small (probably due to too high odd powers setting)
a. Two buffer words when length ‘A’ is even, three buffer words when length ‘A’ is odd.
b. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd.
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
350
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
DSA Signature Verification (High-Level PKA Operations)
Command Code
0x23 (with r” write-back), 0x27 (without r” write-back)
Operation
Generate r” value (and r” = r’ indication) of a DSA verify using prime p, sub-prime n, value g (g’s
multiplicative order modulo p equals q), public key y, message digest h (the number of significant
bits in n should equal the length of h) and DSA signature values r’ and s’.
1. If r’ or s’ are outside the range 1 … n–1, return with an operand value error.
2. Calculate w = s’
−1
(mod n) (s’
−1
is a modular inversion).
3. Calculate u1 = h·w (mod n) and u2 = r’·w (mod n).
4. Calculate r” = (((gu1·yu2) mod p) mod n) and compare this value to the given r’.
Inputs
‘A’
points to public key y (length ‘A’, which must be in range 2 … 130 words),
‘B’
points to prime p followed by value g (both length ‘A’), followed by sub-prime n (length
‘B’, which must be in range 2 … length ‘A’ words), with two or three buffer wordsa
between all sub-vectors, can be an encrypted vector,
‘C’
points to message digest h (length ‘B’),
‘E’
points to r’ followed by s’ with two or three buffer words in betweenb (both length ‘B’).
Number of ‘odd powers’ (in the range 1 … 16).
Result
‘D’
Possible Errors
•
•
•
•
•
•
•
•
Extra Status
N/A
points to r” (length ‘B’) – when using command code 0x23 returned for external
verification of a match,
r” = r’ (Compare result = 001),
r” < r’ (Compare result = 010),
r” > r’ (Compare result = 100).
Illegal encrypted parameter use
Operand length error
Operand value error
Modulus is even
Modulus too short
No inverse exists
Odd powers out-of-range
Farm memory too small (probably due to too high odd powers setting)
a. Two buffer words when length ‘A’ is even, three buffer words when length ‘A’ is odd.
b. Two buffer words when length ‘B’ is even, three buffer words when length ‘B’ is odd.
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
Note: The ‘E’ pointer to r’ and s’ is not present in the result descriptor. A driver is forced to keep
track of the pointer to r’ if external comparison with r” is required.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
351
Appendix C: Miscellaneous Accelerator Specifications
Diffie-Hellman Key Exchanges
The standard modular-exponentiation based Diffie-Hellman key exchange (DH) just uses basic
modular exponentiations (two on both sides, one for generating a value out of a chosen local
secret to be sent to the other side and one for calculating a shared secret using the value received
from the other side). The modular exponentiation command as described in Modular Exponentiation without CRT (Complex Arithmetic) can be used for these operations.
The Elliptic Curve based Diffie-Hellman key exchange (ECDH) is similar in that it uses basic ECC
point multiplications only (in its simplest form just multiply the other side’s public key point with
the local private key scalar and use the result point’s x-coordinate as shared secret). The ECC
point multiplication command described in section ECC Point Multiplication (Complex Arithmetic) can be used for these operations.
AES Known Answer (Verify Operations)
Command Code
0xE0
Operation
Performs an AES operation with the given IV/Key/control/increment information on the given data.
Inputs
‘A’
Points to the IV/Key/control/increment information see
PKI Key Decrypt Key Management Interface for the exact information layout (length ‘A’,
which must be exactly 14 words).
‘B’
Points to the data to decrypt or encrypt (length ‘B’, which must be in range 4 … 508
words and always a multiple of 4 words).
Result
‘C’
Points to decrypt or encrypt data (length ‘B’).
Possible Errors
Illegal encrypted parameter use. Operand length error.
Extra Status
N/A
Additional information on “Restrictions on Input Vectors for PKCP Operations” on page 353,
“PKCP Result Vector Memory Allocation” on page 353, and “PKCP Result Vector / Input Vector
Overlap Restrictions” on page 354.
352
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
C.5.5.4 Restrictions on PKA Operations
Note: Failure to comply with these restrictions will result in an incorrect mathematical result, but
not necessarily an error code in the result descriptor.
Restrictions on Input Vectors for PKCP Operations
Table C-57. Operational Restrictions
Function
Requirements
Multiply
0 < A_Len, B_Len < Max_Len
Add
0 < A_Len, B_Len < Max_Len
Subtract
0 < A_Len, B_Len < Max_Len Result must be positive (A > B)
AddSub
0 < A_Len < Max_Len (B and C operands have A_Len as length, B_Len ignored)
Result must be positive ((A + C) > B)
Right Shift
0 < A_Len < Max_Len
Left Shift
0 < A_Len < Max_Len
Divide, Modulo
1 < B_Len < A_Len < Max_Len
Most significant 32-bit word of B operand cannot be zero
Compare
0 < A_Len < Max_Len (B operand has A_Len as length, B_Len ignored)
Copy
0 < A_Len < Max_Len
PKCP Result Vector Memory Allocation
The host is responsible for allocating a block of contiguous memory in PKA RAM for the result
vector(s). Table C-58 indicates how much memory should be allocated for the result vector(s).
Table C-58. Result Vector Memory Allocation
Function
Result Vector
Result Vector Length (in 32-bit words)
Multiply
C
A_Len + B_Len + 6 (the 6 ‘scratchpad’ words should be discarded)
Add
C
Max(A_Len, B_Len) + 1
Subtract
C
Max(A_Len, B_Len)
AddSub
D
A_Len + 1
Right Shift
C
A_Len
Left Shift
C
A_Len + 1
A_Len
Divide
C
Remainder  B_Len + 1 (one ‘scratchpad’ word should be discarded)
D
Quotient  A_Len — B_Len + 1
Modulo
C
Remainder  B_Len + 1
Compare
None
Compare updates the PKA_COMPARE register
Copy
C
A_Len
(when Shift Value is non-zero)
(when Shift Value is zero)
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
353
Appendix C: Miscellaneous Accelerator Specifications
Input vectors for an operation are always allowed to overlap in memory (partially or completely).
Table C-58 identifies restrictions for the overlap of output and input vectors of the operations.
PKCP Result Vector / Input Vector Overlap Restrictions
Table C-59. Result Vector / Input Vector Overlap Restrictions
Function
Result Vector
Restrictions
Multiply
C
No overlap with A or B vectors allowed.
Add, Subtract
C
May overlap with A and/or B vector, provided the start address of the C vector
does not lie above the start address of the vector(s) with which it overlaps.
AddSub
D
May overlap with A, B and/or C vector, provided the start address of the D vector
does not lie above the start address of the vector(s) with which it overlaps.
Right Shift, Left
Shift
C
May overlap with A vector, provided the start address of the C vector does not lie
above the start address of the A vector C No overlap with A, B or D vectors
allowed.
Divide
D
No overlap with A, B or C vectors allowed.
Modulo
C
No overlap with A or B vectors allowed.
Compare
None
Compare does not write a result vector.
Copy
C
Same restrictions as for Right/Left Shift, copy of a vector to a lower address is
always allowed even if source and destination overlap.
PKCP Operations
Table C-60 lists the arguments and results for each PKCP operation.
Table C-60. Summary of PKCP Vector Operations
Function
Mathematical
Operation
Vector A
Vector B
Vector C
Vector D
Multiply
A x B -> C
Multiplicand
Multiplier
Product
N/A
Add
A + B -> C
Addend
Addend
Sum
N/A
Subtract
A — B -> C
Minuend
Subtrahend
Difference
N/A
AddSub
A + C — B -> D
Addend
Subtrahend
Addend
Result
Right Shift
A >> Shift -> C
Input
N/A
Result
N/A
Left Shift
A << Shift -> C
Input
N/A
Result
N/A
Divide
A mod B -> C,
A div B -> D
Dividend
Divisor
Remainder
Quotient
Modulo
A mod B -> C
Dividend
Divisor
Remainder
N/A
Compare
A = B, A < B, A > B
Input1
Input2
N/A
N/A
Copy
A -> C
Input
N/A
Result
N/A
To obtain correct result, the input vectors must meet the requirements presented in Table C-61.
Note that:
•
354
Input restrictions are not checked by the PKCP
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
•
A_Len and B_Len indicate the size of vectors A and B in (32-bit) words
•
Max_Len equals 64 (32-bit) words, i.e. the standard maximum vector size is 2048 bits
Note: Maximum vector sizes can be optionally extended to 4096 or 8192 bits (with Max_Len equal
to 128 respectively 256).
Table C-61. Restrictions on Input Vectors for PKCP Operations
Operational Restrictions
Function
Requirements
Multiply
0 < A_Len, B_Len < Max_Len
Add
0 < A_Len, B_Len < Max_Len
Subtract
0 < A_Len, B_Len < Max_Len Result must be positive (A > B)
AddSub
0 < A_Len < Max_Len (B and C operands have A_Len as length, B_Len ignored)
Result must be positive ((A + C) > B)
Right Shift
0 < A_Len < Max_Len
Left Shift
0 < A_Len < Max_Len
Divide, Modulo
1 < B_Len < A_Len < Max_Len
The most significant 32-bit word of B operand cannot be zero.
Compare
0 < A_Len < Max_Len (B operand has A_Len as length, B_Len ignored)
Copy
0 < A_Len < Max_Len
The host is responsible for allocating a block of contiguous memory in PKA RAM for the result
vector(s). Table C-62 indicates how much memory should be allocated for the result vector(s).
Table C-62. PKCP Result Vector Memory Allocation
Result Vector Memory Allocation
Function
Result Vector
Result Vector Length (in 32-bit words)
Multiply
C
A_Len + B_Len + 6 (the 6 ‘scratchpad’ words should be discarded)
Add
C
Max(A_Len, B_Len) + 1
Subtract
C
Max(A_Len, B_Len)
AddSub
D
A_Len + 1
Right Shift
C
A_Len
Left Shift
C
A_Len + 1 (when Shift Value is non-zero)
A_Len (when Shift Value is zero)
Divide
C
Remainder -> B_Len + 1 (one ‘scratchpad’ word should be discarded)
D
Quotient -> A_Len — B_Len + 1
Modulo
C
Remainder -> B_Len + 1
Compare
None
Compare updates the PKA_COMPARE register
Copy
C
A_Len
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
355
Appendix C: Miscellaneous Accelerator Specifications
Input vectors for an operation are always allowed to overlap in memory (partially or completely).
Table C-63 gives restrictions for the overlap of output and input vectors of the operations.
Table C-63. PKCP Result Vector / Input Vector Overlap Restrictions
Result Vector / Input Vector Overlap Restrictions
Function
Result Vector
Restrictions
Multiply
C
No overlap with A or B vectors allowed.
Add, Subtract
C
May overlap with A and/or B vector, provided the start address of the C vector does not lie above the start address of the vector(s) with which it overlaps.
AddSub
D
May overlap with A, B and/or C vector, provided the start address of the D
vector does not lie above the start address of the vector(s) with which it overlaps.
Right Shift, Left Shift
C
May overlap with A vector, provided the start address of the C vector does
not lie above the start address of the A vector.
Divide
C
No overlap with A, B or D vectors allowed.
D
No overlap with A, B or C vectors allowed.
Modulo
C
No overlap with A or B vectors allowed.
Compare
None
Compare does not write a result vector.
Copy
C
Same restrictions as for Right/Left Shift, copy of a vector to a lower address
is always allowed even if source and destination overlap.a
a.
The Copy operation can be used to fill memory by breaking the overlap restrictions, but requires TWO initial (32-bit)
words to be set up: To zero a block of memory, set A vector pointer to the block start, set C vector pointer two words
higher and A vector length to the block length minus two (words). Fill the first two words of the block with constant
zero and perform a PKCP Copy operation to zero the remainder of the block.
The Sequencer controls modular exponentiation operations. This document assumes that the
Sequencer program ROM/RAM holds code that implements the following modular exponentiation (ExpMod) operations (using the LNME for most of the work if one is available).
Table C-64. Summary of ExpMod Operations
Function
Mathematical
Operation
Vector A
Vector B
Vector C
Vector D
ExpMod-ACT2,
ExpMod-ACT4,
ExpMod-variable
CA mod B ->
D
Exponent, length
= A_Len
Modulus, length =
B_Len
Base,length
=B_Len
Result & Workspace
ExpMod-CRT
See below
Exp P followed
by Exp Q at next
higher even word
addressa, both
A_Len long
Mod P + buffer
word followed by
Mod Q at next
higher even word
addressb, both
B_Len long
Q inverse,
length = B_Len
Input, Result
(both 2xB-Len
long) & Workspace
a.
b.
356
If A_Len is even, Exp Q follows Exp P immediately – if A_Len is odd, there is one empty word between Exp Q and
Exp P.
If B_Len is even, there are two empty words between Mod P and Mod Q – if B_Len is odd, there is one empty (buffer)
word between Mod Q and Mod P. Note that the words following Mod P and Mod Q may be zeroed by Sequencer
firmware.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
The ExpMod-CRT operation performs the following computation steps1:
•
X<- (Input mod Mod P)Exp P mod Mod P
•
Y <- (Input mod Mod Q)Exp Q mod Mod Q
•
Z <- ((((X – Y) mod Mod P) * Q inverse) mod Mod P) * Mod Q
•
Result <- Y + Z
The ExpMod-ACT2, -ACT4 and -variable functions implement the same mathematical operation but with a differently sized table with pre-calculated ‘odd powers’. The ExpMod-ACT2
function uses a table with two entries whereas ExpMod-ACT4 uses a table with eight entries. The
ACT4 version gives better performance but needs more memory.
ExpMod-variable and ExpMod-CRT allow a variable amount (from 1 up to and including 16) of
odd powers to be selected via the register normally used to specify the number of bits to shift for
shift operations.
For a user of the PKA Engine, the exponentiation functions appear to be extensions of the set of
PKCP functions as described in “PKCP Operations” on page 354. Input and result vectors are
passed just like this is done for basic PKCP operations. Table C-65 shows the restrictions on the
input and result vectors for the exponentiation operations.
Table C-65. Restrictions on Input Vectors for ExpMod Operations
Operational Restrictions
Function
Requirements
ExpMod-ACT2,
ExpMod-ACT4,
ExpMod-variable
1)
0 < A_Len < Max_Len
2)
1 < B_Len < Max_Len
3)
Modulus B must be odd (i.e. the least significant bit must be ONE)
4)
Modulus B > 232
5)
Base C < Modulus B
6)
Vectors B and C must be followed by an empty 32-bit ‘buffer’ word
1)
0 < A_Len < Max_Len
2)
1 < B_Len < Max_Len
3)
Mod P and Mod Q must be odd (i.e. the least significant bits must be ONE)
4)
Mod P > Mod Q > 232
5)
Mod P and Mod Q must be co-prime (their GCD must be 1)
6)
0 < Exp P < (Mod P — 1)
7)
0 < Exp Q < (Mod Q — 1)
8)
(Q inverse * Mod Q) = 1 (modulo Mod P)
9)
Input < (Mod P * Mod Q)
10)
Mod P and Mod Q must be followed by an empty 32-bit ‘buffer’ word
ExpMod-CRT
1. These steps implement Garner’s recombination algorithm after the basic exponentiations.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
357
Appendix C: Miscellaneous Accelerator Specifications
Table C-66 shows the required scratchpad sizes for the exponentiation operations – these depend
upon the PKA Engine type6. The M_Len used in the table is the ‘real’ Modulus length (for Mod P
in an ExpMod-CRT operation, for Modulus B in the other operations) in 32-bit words, i.e. without
trailing zero words at the end. If the last word of the modulus vector as given is non-zero, M_Len
equals B_Len.
Table C-66. ExpMod Result Vector/Scratchpad Area Memory Allocation
Result Vector/Scratchpad Area Memory Allocation (both starting at PKA_DPTR)
Function
PKA Engine
Type
Scratchpad Area Size (in 32-bit Words),
Result Vector is either M_Len or 2xM_Len 32-bit Words Long
ExpMod-ACT2
With LNME
(3 x (M_Len + 2 – (M_Len MOD 2)) + 10
PKCP-only
5 x (M_Len + 2)
With LNME
9 x (M_Len + 2 – (M_Len MOD 2))
PKCP-only
11 x (M_Len + 2)
With LNME
Maximum of (3 x (M_Len + 2 – (M_Len MOD 2)) + 10 and
(# odd powers + 1) x (M_Len + 2 – (M_Len MOD 2))
PKCP-only
(# odd powers + 3) x (M_Len + 2)
With LNME
Maximum of (4 x (M_Len + 2 – (M_Len MOD 2)) + 10 and
(# odd powers + 2) x (M_Len + 2 – (M_Len MOD 2))
PKCP-only
(# odd powers + 3) x (M_Len + 2) + (M_Len + 2 – (M_Len MOD
2))
ExpMod-ACT4
ExpMod-variable
ExpMod-CRT
Note: During execution of an ExpMod-ACT2, -ACT4 or -variable operation, the last 34 bytes of
the PKA RAM are used as general scratchpad for the Sequencer’s program execution. The
ExpMod-CRT operation requires the last 72 bytes of the PKA RAM as scratchpad. These
(fixed location) areas can not overlap with any of the input vectors and/or the D vector
scratchpad area, they can be used freely when executing basic PKCP operations.
Table C-67. ExpMod Scratchpad Area / Input Vector Overlap Restrictions
Result Vector/Scratchpad Area Memory Allocation (both starting at PKA_DPTR)
Function
Result Vector
Restrictions
ExpMod-ACT2,
ExpMod-ACT4
ExpMod-variable
D
Scratchpad area starting at D may not overlap with any of the
other vectors, except that Base C may be co-located with result
vector D to save space (i.e. PKA_CPTR = PKA_DPTR is
allowed).
ExpMod-CRT
D
Scratchpad area starting at D may not overlap with any of the
other vectors, this is
also the location of the main Input vector (with length 2 x B_Len)
358
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
C.5.6
PKI Key Decrypt Key Management Interface
The PKI Key Decrypt Key management interface is kept very simple as seen from the PKA side:
Four 16-word areas (all in on-chip secure RAM) are set aside to store Key Decrypt Keys, IV values, CTR mode increment values and the control bits for the AES core. Loading of the Key
Decrypt Keys is left up to the Host.
No functions are defined to actually manage the Key Decrypt Keys other than through direct
access from the Host bus.
During High Assurance mode boot-up, the locations in secure RAM used for KDK storage and
control are used to transfer PKA ‘farm’ engine firmware into the PKA. The whole KDK storage
area and control words are then zeroed after the firmware has been copied to its intended internal
locations.
Table C-68 shows the layout of the secure RAM, holding the parameter and control words for the
PKI Black Key Decrypt functionality. These locations are described in further detail in the following sub-sections. The secure RAM address space ranges from 0x10000 to 0x11FFF.
Table C-68. Secure RAM Layout for PKI Key Decrypt Key Storage
Byte Address
(Within Secure RAM)
Control/Status
Word Name
Description
0x0000
PKI_KD_IV_0_0
Initialization vector associated with PKI KDK number 0
(needed for non-ECB modes)
0x0004
PKI_KD_IV_0_1
0x0008
PKI_KD_IV_0_2
0x000C
PKI_KD_IV_0_3
0x0010
PKI_KDK_0_0
0x0014
PKI_KDK_0_1
0x0018
PKI_KDK_0_2
0x001C
PKI_KDK_0_3
0x0020
PKI_KDK_0_4
0x0024
PKI_KDK_0_5
0x0028
PKI_KDK_0_6
0x002C
PKI_KDK_0_7
0x0030
PKI_KDK_CONTROL_0
Control word for PKI KDK number 0, also validity check
word.
0x0034
PKI_KD_INCR_0
CTR mode increment value for PKI KDK number 0.
0x0038 - 0x003F
Reserved, write zero
-
Actual PKI Key Decrypt Key number 0:
Least significant word always PKI_KDK_0_0,
128 bits key most significant word in PKI_KDK_0_3,
192 bits key most significant word in PKI_KDK_0_5,
256 bits key most significant word in PKI_KDK_0_7
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
359
Appendix C: Miscellaneous Accelerator Specifications
Table C-68. Secure RAM Layout for PKI Key Decrypt Key Storage (continued)
360
Byte Address
(Within Secure RAM)
Control/Status
Word Name
Description
0x0040
PKI_KD_IV_1_0
Initialization vector associated with PKI KDK number 1
(needed for non-ECB modes).
0x0044
PKI_KD_IV_1_1
0x0048
PKI_KD_IV_1_2
0x004C
PKI_KD_IV_1_3
0x0050
PKI_KDK_1_0
0x0054
PKI_KDK_1_1
0x0058
PKI_KDK_1_2
0x005C
PKI_KDK_1_3
0x0060
PKI_KDK_1_4
0x0064
PKI_KDK_1_5
0x0068
PKI_KDK_1_6
0x006C
PKI_KDK_1_7
0x0070
PKI_KDK_CONTROL_1
Control word for PKI KDK number 1, also validity check
word.
0x0074
PKI_KD_INCR_1
CTR mode increment value for PKI KDK number 1.
0x0078 - 0x007F
Reserved, write zero
-
0x0080
PKI_KD_IV_2_0
Initialization vector associated with PKI KDK number 2
(needed for non-ECB modes).
0x0084
PKI_KD_IV_2_1
0x0088
PKI_KD_IV_2_2
0x008C
PKI_KD_IV_2_3
0x0090
PKI_KDK_2_0
0x0094
PKI_KDK_2_1
0x0098
PKI_KDK_2_2
0x009C
PKI_KDK_2_3
0x00A0
PKI_KDK_2_4
0x00A4
PKI_KDK_2_5
0x00A8
PKI_KDK_2_6
0x00AC
PKI_KDK_2_7
0x00B0
PKI_KDK_CONTROL_2
Control word for PKI KDK number 2, also validity check
word.
0x00B4
PKI_KD_INCR_2
CTR mode increment value for PKI KDK number 2.
0x00B8 - 0x00BF
Reserved, write zero
-
Actual PKI Key Decrypt Key number 1:
Least significant word always PKI_KDK_1_0,
128 bits key most significant word in PKI_KDK_1_3,
192 bits key most significant word in PKI_KDK_1_5,
256 bits key most significant word in PKI_KDK_1_7.
Actual PKI Key Decrypt Key number 2:
Least significant word always PKI_KDK_2_0,
128 bits key most significant word in PKI_KDK_2_3,
192 bits key most significant word in PKI_KDK_2_5,
256 bits key most significant word in PKI_KDK_2_7.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
Table C-68. Secure RAM Layout for PKI Key Decrypt Key Storage (continued)
Byte Address
(Within Secure RAM)
Control/Status
Word Name
Description
0x00C0
PKI_KD_IV_3_0
Initialization vector associated with PKI KDK number 3
(needed for non-ECB modes).
0x00C4
PKI_KD_IV_3_1
0x00C8
PKI_KD_IV_3_2
0x00CC
PKI_KD_IV_3_3
0x00D0
PKI_KDK_3_0
0x00D4
PKI_KDK_3_1
0x00D8
PKI_KDK_3_2
0x00DC
PKI_KDK_3_3
0x00E0
PKI_KDK_3_4
0x00E4
PKI_KDK_3_5
0x00E8
PKI_KDK_3_6
0x00EC
PKI_KDK_3_7
0x00F0
PKI_KDK_CONTROL_3
Control word for PKI KDK number 3, also validity check
word.
0x00F4
PKI_KD_INCR_3
CTR mode increment value for PKI KDK number 3.
0x00F8 - 0x00FF
Reserved, write zero
-
0x0100 - 0x1FFF
Internal use, do not modify
Holds PKI command/result ring management, general
PKA master controller scratchpad, command pre- and
post-data areas.
Actual PKI Key Decrypt Key number 3:
Least significant word always PKI_KDK_3_0,
128 bits key most significant word in PKI_KDK_3_3,
192 bits key most significant word in PKI_KDK_3_5,
256 bits key most significant word in PKI_KDK_3_7.
The layout of the PKI_KD… words in secure RAM is chosen so that they can be transferred as one
block into the control registers of the PKA’s local AES core using the local DMA engine.
C.5.6.1 AES Byte Order Example
The following example is based on NIST Special Publication 800-38A, Appendix F (for additional
information about this publication, see the reference on “NIST Special Publication 800-38A”
on page 579). As indicated in the example below, PKI_KDK, PKI_KD_IV and Input Data (data
where the pointers ‘A’...’E’ in Table C-53 refer to) need to be byte swapped when copied into the
local AES engine. After processing, the output data leaves the AES engine as the decrypted ‘Input
Data’. The red color indicates the first byte of each item.
Listing C-6. AES-CTR 128-Bit Decrypt.
NIST SP 800-38a:
AES Key In:
2b7e1516_28aed2a6_abf71588_09cf4f3c
IV/CTR In:
f0f1f2f3_f4f5f6f7_f8f9fafb_fcfdfeff
AES Data In: 874d6191_b620e326_1bef6864_990db6ce
AES Data Out: 6bc1bee2_2e409f96_e93d7e11_7393172a
KDK storage locations in secure RAM (input):
PKI_KDK_x_0[31:0]:
0x16157e2b
PKI_KDK_x_1[31:0]:
0xa6d2ae28
PKI_KDK_x_2[31:0]:
0x8815f7ab
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
361
Appendix C: Miscellaneous Accelerator Specifications
PKI_KDK_x_3[31:0]:
0x3c4fcf09
PKI_KD_IV_x_0[31:0]: 0xf3f2f1f0
PKI_KD_IV_x_1[31:0]: 0xf7f6f5f4
PKI_KD_IV_x_2[31:0]: 0xfbfaf9f8
PKI_KD_IV_x_3[31:0]: 0xfffefdfc
PKI_KDK_CONTROL_x:
0xffdf0020 (AES-CTR 128-bit Decrypt)
PKI_KD_INCR_x[31:0]: 0x00000001
Encrypted Input Data (Local AES core input):
ENC_DATA_0[31:0]:
0x91614d87
ENC_DATA_1[31:0]:
0x26e320b6
ENC_DATA_2[31:0]:
0x6468ef1b
ENC_DATA_3[31:0]:
0xceb60d99
Decrypted Data (Local AES core output):
AES_DATA_IO_0[31:0]: 0xe2bec16b
AES_DATA_IO_1[31:0]: 0x969f402e
AES_DATA_IO_2[31:0]: 0x117e3de9
AES_DATA_IO_3[31:0]: 0x2a179373
C.5.6.2 PKI Key Decrypt Keys Storage (PKI_KDK_0_[0:7] … _3_[0:7])
The PKI Key Decrypt Keys storage uses four times 8 words at the start of the secure RAM. Keys
are stored with their least significant word first. A 128 bits key only uses the first 4 words of each
storage area, while a 192 bits key uses the first 6 words of each storage area. For byte order information, please refer to AES Byte Order Example.
Note: The PKI Key Decrypt Keys do not need to be pre-processed AES ‘decrypt’ keys; conversion
of normal AES keys stored here to AES decrypt keys is done automatically within the
local AES core.
PKI_KDK_0_[0:7]
PKI_KDK_0_[0:7] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10010-0x1002F
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
X
x
x
x
x
x
KDK_0
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-69. PKI_KDK_0_[0:7] Bit Descriptions
362
Bits
Name
Type
Function
[31:0]
KDK_0
R/W
Eight consecutive words holding PKI Key Decrypt Key number 0 (located in secure
RAM).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
PKI_KDK_1_[0:7]
PKI_KDK_1_[0:7] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10050-0x1006F
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
X
x
x
x
x
x
KDK_1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-70. PKI_KDK_1_[0:7] Bit Descriptions
Bits
Name
Type
Function
[31:0]
KDK_1
R/W
Eight consecutive words holding PKI Key Decrypt Key number 1 (located in secure
RAM).
PKI_KDK_2_[0:7]
PKI_KDK_2_[0:7] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10090-0x100AF
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KDK_2
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-71. PKI_KDK_2_[0:7] Bit Descriptions
Bits
Name
Type
Function
[31:0]
KDK_2
R/W
Eight consecutive words holding PKI Key Decrypt Key number 2 (located in secure
RAM).
PKI_KDK_3_[0:7]
PKI_KDK_3_[0:7] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100D0-0x100EF
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KDK_3
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-72. PKI_KDK_3_[0:7] Bit Descriptions
Bits
Name
Type
Function
[31:0]
KDK_3
R/W
Eight consecutive words holding PKI Key Decrypt Key number 3 (located in secure
RAM).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
363
Appendix C: Miscellaneous Accelerator Specifications
C.5.6.3 PKI Key Decrypt IVs Storage (PKI_KD_IV_0_[0:3] … _3_[0:3])
The PKI Key Decrypt Initialization Vector (IV) storage uses four times 4 words at the start of the
secure RAM. IVs are stored with their least significant word first. For byte order information,
please refer to AES Byte Order Example.
PKI_KD_IV_0_[0:3]
PKI_KD_IV_0_[0:3] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10000-0x1000F
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KD_IV_0
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-73. PKI_KD_IV_0_[0:3] Bit Descriptions
Bits
Name
Type
Function
[31:0]
KD_IV_0
R/W
Four consecutive words holding the Initialization Vector associated with PKI Key
Decrypt Key number 0 (located in secure RAM).
PKI_KD_IV_1_[0:3]
PKI_KD_IV_1_[0:3] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10040-0x1004F
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KD_IV_1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-74. PKI_KD_IV_1_[0:3] Bit Descriptions
364
Bits
Name
Type
Function
[31:0]
KD_IV_1
R/W
Four consecutive words holding the Initialization Vector associated with PKI Key
Decrypt Key number 1 (located in secure RAM).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
PKI_KD_IV_2_[0:3]
PKI_KD_IV_2_[0:3] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10080-0x1008F
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KD_IV_2
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-75. PKI_KD_IV_2_[0:3] Bit Descriptions
Bits
Name
Type
Function
[31:0]
KD_IV_2
R/W
Four consecutive words holding the Initialization Vector associated with PKI Key
Decrypt Key number 2 (located in secure RAM).
PKI_KD_IV_3_[0:3]
PKI_KD_IV_3_[0:3] (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100C0-0x100CF
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KD_IV_3
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-76. PKI_KD_IV_3_[0:3] Bit Descriptions
Bits
Name
Type
Function
[31:0]
KD_IV_3
R/W
Four consecutive words holding the Initialization Vector associated with PKI Key
Decrypt Key number 3 (located in secure RAM).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
365
Appendix C: Miscellaneous Accelerator Specifications
C.5.6.4 PKI Key Decrypt CTR Mode Increment Storage (PKI_KD_INCR_0 … _3)
The PKI Key Decrypt Keys must be accompanied by a 32 bits increment value when CTR mode
decrypt is being used. These increment values are stored in secure RAM and are copied to the
local AES core’s AES_INC register when needed. Note that this is an internal register that is not
accessible by the Host. For byte order information, please refer to “AES Byte Order Example”
on page 361.
PKI_KD_INCR_0
PKI_KD_INCR_0 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10034-0x10037
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KD_INCR_0
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-77. PKI_KD_INCR_0 Bit Descriptions
Bits
Name
Type
Function
[31:0]
KD_INCR_0
R/W
One word holding the CTR mode increment value associated with PKI Key Decrypt
Key number 0 (located in secure RAM).
PKI_KD_INCR_1
PKI_KD_INCR_1 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x10074-0x10077
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KD_INCR_1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-78. PKI_KD_INCR_1 Bit Descriptions
366
Bits
Name
Type
Function
[31:0]
KD_INCR_1
R/W
One word holding the CTR mode increment value associated with PKI Key Decrypt
Key number 1 (located in secure RAM).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
PKI_KD_INCR_2
PKI_KD_INCR_2 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100B4-0x100B7
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KD_INCR_2
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-79. PKI_KD_INCR_2 Bit Descriptions
Bits
Name
Type
Function
[31:0]
KD_INCR_2
R/W
One word holding the CTR mode increment value associated with PKI Key Decrypt
Key number 2 (located in secure RAM).
PKI_KD_INCR_3
PKI_KD_INCR_3 (Restricted Read/Write), 18-bit Address in Host Target Window: 0x100F4-0x100F4
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
x
KD_INCR_3
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table C-80. PKI_KD_INCR_3 Bit Descriptions
Bits
Name
Type
Function
[31:0]
KD_INCR_3
R/W
One word holding the CTR mode increment value associated with PKI Key Decrypt
Key number 3 (located in secure RAM).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
367
Appendix C: Miscellaneous Accelerator Specifications
C.5.6.5 PKI Key Decrypt Key Control Words
Each of the four PKI Key Decrypt Keys (KDK) has a separate control word whose bits [15:0] are
transferred to the local AES core’s AES_MODE register. Note that this is an internal register that is
not accessible by the Host. A KDK is assumed to be valid when bits [15:10] are zero and bits [31:16]
are the bit-by-bit complement of bits [15:0].
PKI_KDK_CONTROL_x
PKI_KDK_CONTROL_0 (Restricted Read/Write), 18-bit Address in Host Target Window:
0x10030-0x10033
PKI_KDK_CONTROL_1 (Restricted Read/Write), 18-bit Address in Host Target Window:
0x10070-0x10073
PKI_KDK_CONTROL_2 (Restricted Read/Write), 18-bit Address in Host Target Window:
0x100B0-0x100B3
x
x
x
x
x
x
x
x
x
x
x
x
4
3
2
key_size
x
5
ecb
x
6
cbc
x
7
ctr
x
x
x
x
x
x
1
0
Must be zeroes
x
x
x
x
x
x
x
x
x
encrypt
Bit-by-bit complement of bits [9:0]
cfb_width
Must be ones
8
ofb
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
cfb
PKI_KDK_CONTROL_3 (Restricted Read/Write), 18-bit Address in Host Target Window:
0x100F0-0x100F3
x
x
Table C-81. PKI_KDK_CONTROL_x Bit Descriptions
Bits
Name
Type
Function
[31:22]
Must be ones
R/W
This field must be all ones to have this KDK considered as valid.
[25:16]
Bit-by-bit
complement
of bits [9:0]
R/W
This field should contain the bit-by-bit complement of cfb_width … encrypt bits [9:0] to
have this KDK considered as valid.
[15:10]
Must be
zeroes
R/W
This field must be all zeroes to have this KDK considered as valid.
[9:8]
cfb_width
R/W
Sets the number of bits fed back for the CFB operation. ‘00’ feeds back 128 bits, ‘01’
feeds back 1 (ONE) bit, ‘10’ feeds back 8 bits, ‘11’ is reserved, do not use.
[7]
cfb
R/W
Indicates Cipher Feed-Back mode operations are to be performed, mutually exclusive
with the other mode selection bits (exactly one of these five must be set to ‘1’).
[6]
ofb
R/W
Indicates Output Feed-Back mode operations are to be performed, mutually exclusive
with the other mode selection bits (exactly one of these five must be set to ‘1’).
[5]
ctr
R/W
Indicates Counter mode operations are to be performed, mutually exclusive with the other
mode selection bits (exactly one of these five must be set to ‘1’).
[4]
cbc
R/W
Indicates Cipher Block Chaining mode operations are to be performed, mutually exclusive with the other mode selection bits (exactly one of these five must be set to ‘1’).
[3]
ecb
R/W
Indicates Electronic Code Book mode operations are to be performed, mutually exclusivea with the other mode selection bits (exactly one of these five must be set to ‘1’).
[2:1]
key_size
R/W
These two write only bits specify the key length to use. ‘00’ selects 128-bit keys, ‘01’
selects 192 bit keys, ‘10’ selects 256 bits keys, ‘11’ is reserved, do not use.
[0]
encrypt
R/W
Specifies encrypt (‘1’) or decrypt (‘0’) operation to be performed. Will normally be ‘0’
a. Actually, there is a priority encoding of the five mode control bits done, but we advise NOT to use this feature.
368
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Public Key Accelerator (PKA)
C.5.7
PKI Engine Boot-Up and Internal Error Reporting
The PKA internal firmware uses the PKA_MASTER_SEQ_CTRL register accessible through the Host
interface (in non-High Assurance mode) for internal error reporting and ‘side-channel’ control.
Table C-82. PKA_MASTER_SEQ_CTRL Register Bit Descriptions
Bits
Name
Type
Function
[31]
Reset
R/W
Active HIGH software reset for the PKA master controller, enables access to
PKA master controller program RAM when ‘1’.
[30:16]
RESERVED
--
Reserved: write zeroes and ignore on read.
[15:8]
Status
R
This field conveys status information. Bit [8] is used to generate the PKA master interrupt (always set on an error), bit [15] indicates an actual error situation:
0x00:
No error
0x01:
No error, used to trigger the Host software during bootup
0x83:
List full error
0x85:
Process sequence state error
0x87:
Invalid address error
0x89:
DMA error
0x8B: Invalid use/setting
0x8D: Invalid or no command
0x8F: Invalid farm number
0xFD: Function not available
0xFF: Severe error (suspected deadlock)
When an error is reported (bit [15] is HIGH), the buffer RAM word at Host window offset address 0x00074 (that is the word following the control word
PKA_RING_OPTIONS) holds a pointer into the firmware code that indicates the instruction where the error was detected – the firmware itself is
halted.
[7]
SW_reset
Set-only
Set this bit HIGH to abort all operations in the PKA gracefully (that is without
breaking off ongoing Host transfers). Automatically reset LOW after handling.
[6:2]
RESERVED
--
Reserved: write zeroes and ignore on read.
[1]
Reset_DMA
Set-only
Set this bit HIGH to reset the internal DMA channel accessing the Host interface. Use this after a DMA error has been reported. Automatically reset LOW
after handling.
[0]
RESERVED
--
Reserved: write zero and ignore on read.
The boot-up sequence of the PKA requires three firmware images:
•
Master boot image
•
Farm engine execution image
•
Master execution image
The sequence to boot up the PKA is as follows:
1.
Load the master boot image into PKA_MASTER_PROG_RAM.
2.
Load farm engine execution image in PKA_BUFFER_RAM (non-High Assurance mode) or in
PKA_SECURE_RAM (High Assurance mode).
3.
Take the PKA master controller out of reset (clear bit [31] of the PKA_MASTER_SEQ_CTRL
register) – this starts distribution of the farm engine execution image to the farm engines and
performs other preparatory steps.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
369
Appendix C: Miscellaneous Accelerator Specifications
4.
Wait until the pka_master_irq becomes active (poll the AIC_RAW_STAT register or poll bit
[8] of the PKA_MASTER_SEQ_CTRL).
5.
Verify that the PKA master controller set bits [15:8] of the PKA_MASTER_SEQ_CTRL register to
value 0x01. If that is not the case, then the boot image program encountered an error.
6.
Push the PKA master controller into reset (set bit [31] of the PKA_MASTER_SEQ_CTRL
register).
7.
Load master execution image into PKA_MASTER_PROG_RAM.
8.
Take the PKA master controller out of reset (clear bit [31] of the PKA_MASTER_SEQ_CTRL register; this starts the actual execution image.
9.
Write ring configuration and control words at the start of the PKA_BUFFER_RAM and then
write the PKA_RING_OPTIONS register. The PKA is now ready to receive the first command.
Note: The pka_mst_clk must be running during the whole boot up sequence.
After this sequence, the PKA is operational and it will be able to handle commands. Note that
when the first command descriptor is set up and executed, the PKA will automatically enter the
execution stage.
C.6 Conventions Used in this Manual
C.6.1
Register Information
Registers within this document are shown as follows:
REGISTER_HEAD (Write Only), 18-bit Address in Host Target Window: 0x00000
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
The register name, accessibility and the Host address location for direct access are on the top lines.
The table shows all the register bit fields, a supporting description is included in the text below
the register table. Reserved fields are shaded gray. The bottom row in the register graphic shows
the power-up / reset default setting of the register for read.
370
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
A PPENDIX D:
I NLINE P ACKET E NGINE
D.1 Crypto Packet Processor Processing Overview
This appendix provides an overview of the Crypto Packet Processor and its function in the following sections:
•
D.1 “Crypto Packet Processor Processing Overview” on page 371
•
D.2 “Configuring the Crypto Packet Processor” on page 372
•
D.3 “Pseudo Random Number Generator” on page 375
•
D.4 “Input Token Definition” on page 379
•
D.5 “Processing Instructions” on page 392
•
D.6 “Result Token Definition” on page 429
•
D.7 “Pre and Post-Processing by Host Software” on page 432
•
D.8 “Context Record Definition” on page 433
•
D.9 “Register and Memory Map” on page 456
•
D.10 “Protocol Compliancy” on page 481
D.1.1
Crypto Packet Processor Terms
This appendix makes frequent use of the terms token and context (see www.iana.org/assignments/protocol- numbers). These terms refer to data structures that the Crypto Packet
Processor uses to perform packet processing operations. In IPSec, a context is a (packet independent) data structure that contains key material and processing parameters associated with the
processing of packets that have been recognized (classified) to be sent through a specific
IPSec tunnel. In IPSec terminology, the Security Association (SA) structure that defines an
IPSec tunnel is represented as the context in the Crypto Packet Processor. Because the Crypto
Packet Processor can support more protocols than just IPSec, the Crypto Packet Processor context contains more parameters than just the IPSec SA parameters. The term context is often said to
describe a packet transform. The term transform stems from the definition: if a packet is processed according to the rules specified in the context, the packet is said to have been transformed
by the Crypto Packet Processor. Alternate terms for the transformation of packets in this
way are tunneling or encapsulation, which specifically refer to packet transformations requiring the
encryption of plaintext data, and the detunneling or decapsulation, the reverse operation.
Clearly, a context contains data that is not packet specific; multiple packets can be processed (transformed) under the same context.
The term token refers to a data structure that is created for each (individual) packet. A token is a
specific data structure containing processing commands and instructions that the Crypto
Packet Processor uses to process one specific packet. Among other things, a token contains a number of parameters extracted from the packet itself, for example, packet or packet header
length, specific packet header fields or offsets to such fields in the packet stream, that is, data
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
D.1 371
Appendix D: Inline Packet Engine
that can change with each packet. A Crypto Packet Processor token also contains processing
instructions used to control the detailed packet processing operation performed by the Crypto
Packet Processor.
D.1.1.1 Tokens
The input token is provided as part of the operation’s Extra Data (see “Operand Data Specification” on page 181). The token provided to the Crypto Packet Processor must have at least five
words: four words that are always the first four words of an input token and one instruction word.
The first four words contain pointers, options and packet length. The next token words can
contain instructions or data. The length of the input token is limited by the maximum size allowed
in Extra Data. The order and sequence of instructions and data in the input token is restricted and
is explained in “Input Token Definition” on page 379. The result token can contain four to eight
data words and contains the result-packet length, error flags, and packet result information. See
“Result Token Definition” on page 429.
D.1.1.2 Context
The context data is provided as part of the operation’s Extra Data (see “Operand Data Specification” on page 181). The context data contains processing parameters and keying data. The first two
words of each context contain the options and information about the available fields in the context.
The sequence of the different fields within a context is fixed, but not all fields need to be available.
The length of a field is variable and depends on the selected options and algorithms in the first
two context words. For optimal memory usage, all available fields are concatenated to each other.
D.2 Configuring the Crypto Packet Processor
Before the Crypto Packet Processor can be used for packet processing, the Host must initialize its
configuration registers. The configuration registers are accessible in the Engine Global Address
Space. This section will discuss some general principles and methods for initializing and using the
Crypto Packet Processor efficiently. General concepts are presented here; details on the registers
used for configuration can be found in D.9 “Register and Memory Map” on page 456.
D.2.1
Enabling Protocol and Algorithm Support
First, you must decide if you want protocol and algorithm support to be enabled. By default, all
implemented cryptographic protocols and algorithms in the Crypto Packet Processor are enabled.
Individual algorithms as well as protocol support can be disabled by using the Protocol/Algorithm Enable register, see “Protocol/Algorithm Enable Register” on page 466. If you enable
protocol support it allows software on the Host system to determine the hardware capabilities. If
an algorithm is disabled and the software tries to use that algorithm, an error will be is generated.
D.2.2
Context Fetch Modes
The second decision relates to the context fetching. The Crypto Packet Processor supports two
modes for context fetching:
1.
372
Address mode. In Address mode, the context fields are fetched from the Context Record
Address Map shown in “Context Record Format” on page 433, always starting from Context
Control Word 0. Irrelevant fields can be filled with dummy values. The number of context
words fetched is indicated by the Context Size (in dwords) located in the Context Control
Register (see “Context Control Register” on page 466). This fetch mode can be used when the
size of the context is fixed, that is independent of the algorithm and mode used.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Configuring the Crypto Packet Processor
2.
Control mode. In Control mode, the context fields are fetched from a customized Context
Record, based on the control bits in the context control words. Control mode optimizes the
context fetch by supporting a customized context record layout containing only relevant fields
(abutted to each other) for the requested operation.
When bit C in the Input Token Header is set (see “C: Context Control Words Present in
Token” on page 381), the fetch does not include the Context Control Words 0/1, since these
were fetched with the token. The fetch begins after Context Control Word 1 (at starting
address 0x02) with a length defined by the context length field in Context Control Word
0.
When bit C in the Input Token Header is not set, the fetch begins with Context Control Word 0
and number of fetched context words is indicated by the Context Size (in dwords) located in
the Context Control Register, similar to address mode.
This fetch mode allows for using an optimal context record size and reducing the overhead
caused by fetching unused fields.
Table D-1 outlines how the selection of the Context Fetch Mode is arbitrated between settings
of the Context Control Register, bits [9:8] and the Context Control Word 1, bit [31].
Table D-1. Context Fetch Control
Context Control Register, Bits [9:8]
Context Control Word 1, Bit [31]
Context Fetch Mode
00
x
address mode (default)
01
x
address mode
10
x
control mode
11
0
control mode
11
1
address mode
See also “Context Control Register” on page 466 and “Context Record Definition” on page 433.
D.2.3
Packet Processing Modes
The packet-processing module contains all crypto and hashing blocks including a programmable
interconnect mechanism for these blocks, which allows for various processing modes.
The packet-processing mode is controlled by the Type of Packet field in the context control
word 0 (see “Control Word 0 Field Encoding” on page 436). All possible configurations of the datapath are shown in the figures below.
direction
crossbar
packet processor
output
input
hash
crypto
output
Blocked
Blocked
Figure D-1: Packet Processor with ToP=4’b000x
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
373
Appendix D: Inline Packet Engine
packet processor
direction
crossbar
direction
crossbar
packet processor
output
output
input
Hash
hash
digest
output
context
M
U
X
input
M
U
X
Blocked
Blocked
crypto
output
hash
crypto
Crypto
Figure D-2: Packet Processor with ToP=4’b001x (left) and with ToP=4’b010x (right)
direction
crossbar
packet processor
direction
crossbar
output
output
input
packet processor
hash
only
hash
output
Hash
context
encrypt-then-hash
decrypt-then-hash
M
U
X
Hash
hash
crypto
Crypto
crypto
input
digest
output
context
M
U
X
Crypto
Figure D-3: Packet Processor with ToP=4’b011x (left) and with ToP=4’b111x (right)
Note: Because the system is so complex, these figures represent the functional behavior of the
system and not the actual physical implementation.
As shown in the figures above, the input data can be directed to three possible destinations
within the packet processor (crypto, hash, and/or output), defined by the
Type of destination field (ToD) of the token instruction. Output of the packet processor
is always passed to the post-processor module. The table below describes the relation between
Type of destination field of the instruction and Type of Packet field (ToP) of context
control word0.
In summary, the ToD field determines where the input data should be made available in the
packet processor (the crypto block and/or the hash block and/or passed to the output). The
ToP field in the context configures the datapath within the packet processor itself, enabling
crypto and hash blocks when needed and determining the order of these operations.
Table D-2. Relation between ‘Type of Destination’ and ‘Type of Packet’ Fields
‘Type of destination’ Field
Operation
Crypto
Hash
Output
0
0
1
xxxx1
Pass data to the output.
0
1
0
xx1x
Pass data to the hash engine.
xx0x
Remove data.
xx1x
Pass data to the hash engine and also to the output.
xx0x
Pass data to the output.
0
374
‘Type of Packet’
Field
1
1
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Pseudo Random Number Generator
Table D-2. Relation between ‘Type of Destination’ and ‘Type of Packet’ Fields (continued)
‘Type of destination’ Field
Crypto
Hash
Output
1
0
0
1
1
1
1
0
1
1
0
1
1
‘Type of Packet’
Field
Operation
x1xx
Pass data to the crypto engine;
Encrypted/decrypted data are ignored.
x0xx
Remove data
x1xx
Pass data to crypto engine and after encryption/decryption pass to
the output.
x0xx
Pass data to the output.
000x
Remove data.
001x
Pass data only to the hash engine.
x10x
Pass data to the crypto engine;
Encrypted/Decrypted data are ignored.
011x
Pass data to the crypto engine;
Encrypted/decrypted data are passed to the hash engine.
111x
Pass data to the crypto engine and pass the same data to the hash
engine;
Encrypted/Decrypted data are ignored.
000x
Pass data to the output.
001x
Pass data to the output and at the same time to the hash.
x10x
Pass data to the crypto engine and after encryption/decryption pass
to the output.
011x
Pass data to the crypto engine;
Encrypted/decrypted data are passed to the hash engine and to the
output.
111x
Pass data to the crypto engine and pass the same data to the hash
engine;
Encrypted/decrypted data are passed to the output.
Note: x = Don’t care.
D.3 Pseudo Random Number Generator
This section describes the Pseudo Random Number Generator (PRING), its purpose, architecture,
and function.
D.3.1
Purpose
This Crypto Packet Processor includes an ANSI X9.17 compliant Pseudo Random Number Generator
(PRNG) that provides a pseudo random data for generating keys, Initialization Vectors (IVs), etc.
It provides up to 16-bytes of data at a time, enabling the generation of random IVs for Data
Encryption Standard (DES), Triple-DES and Advanced Encryption Standard (AES) on a per
packet basis without slowing down the system. The DES block within the PRNG uses a 64-bit
LFSR as input data.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
375
Appendix D: Inline Packet Engine
Unlike true random number generators, which exploit the randomness that occurs in some physical phenomena, pseudo random number generators are devices or algorithms that output
statistically independent and unbiased numbers. In general, a PRNG is a deterministic algorithm
that for a given truly random binary sequence (provided as a combination of the seed and key registers), outputs a binary sequence that “appears” to be random.
D.3.2
Architecture
The PRNG module architecture diagram is shown in Figure D-4.
PRNG
Processor
Interface
seed_l
seed_h
key0_l
key0_h
key1_l
key1_h
res0_l
res0_h
res1_l
res1_h
lfsr_l
lfsr_h
ctrl
stat
enable
auto
control
PRNG
Control
PRNG
Counter
Triple-DES
start_des
des_rdy
DES
Control
control
seed
lfsr
result0/1
control
PRNG
Datapath
XOR-logic
result
key0/1
DES
Datapath
Figure D-4: Pseudo Random Number Generator Architecture Diagram
The Control logic contains a state machine that controls the pseudo random number generation
process. The PRNG module is configured for an operation through a set of registers in Engine
Global Register Space. To determine the status of the PRNG module access the PRNG_STAT
register.
D.3.3
Functional Description
The operation of the Crypto Packet Processor internal PRNG is based on the “ANSI X9.17, Annex
C example pseudo-random key and IV generation algorithm”. The PRNG can generate 64-bit or
128-bit pseudo-random numbers. A 128-bit result is generated by running the operation for 64-bit
numbers twice, returning the results in the PRNG_RES0 and PRNG_RES1 registers. When a 64-bit
result is requested, only PRNG_RES0 is used. A 64-bit result is generated as follows:
I = ede * K ( DT )
R = ede * K ( I ⊕ V ).
And, a new V is generated by: V = ede * K ( R ⊕ I ).
376
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Pseudo Random Number Generator
Here, ede means encryption-decryption-encryption as one form of a Triple-DES operation (see
Figure D-5). A ciphertext C is calculated from a plaintext P using the following formula:
C = E K 0 [ D K1 [ E K 0 [ P ]]].
The key pair K0 and K1 is reserved only for the generation of keys, so they should not be the same
as any previously known values.
P
K0
K1
K0
E
D
E
C
Figure D-5: Multiple Encryption with Triple DES Using Two Keys
The algorithm consists of three Triple-DES operations that use the same key pair *K. A schematic
overview of the algorithm is provided below in Figure D-6.
Key0/1
LFSR
K0,K1
DT
TripleDES 1
I
XOR
TripleDES 2
XOR
V
TripleDES 3
R
Seed
Result
Figure D-6: Schematic Overview of the Pseudo-Random Algorithm Specified by ANSI X9.17
Input to the first Triple-DES operation is a secret value DT (DT stands for date/time). Since DT
must be updated on each pseudo-random number generation, it is the output of an LFSR (see
“Generation of DT” on page 377).
Output of the first Triple-DES operation is the intermediate value I, which is stored for later use.
Input to the second Triple-DES operation is the exclusive-or operation of I with a secret seed
value V, which can be an arbitrary number.
Output of the second Triple-DES operation is the vector R, the most important pseudo-random
number for us.
Input to the third Triple-DES operation is the exclusive-or operation of the intermediate value I
with R. Output of the third Triple-DES operation is an updated seed value that is stored and used
as input for the next second Triple-DES operation on the next key generation.
All numbers are 64-bit wide, except for the key pair that consists of two 56-bit keys.
D.3.4
Generation of DT
The plaintext input DT is generated using a 64-bit length LFSR output. The LFSR is based on the
primitive polynomial f(x) = x64 + x63 + x61 + x60 + 1 (see Figure D-7). After reset, the Host must
seed the LFSR through the APB slave interface using two write transfers. The seed for the LFSR
can be any value except all zeroes.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
377
Appendix D: Inline Packet Engine
0
D
Q
1
59
D
Q
Reg
60
XOR
D
Reg
Ck
Q
61
XOR
D
D
Reg
Reg
Ck
62
Q
63
XOR
D
Q
64
Reg
Reg
Ck
Ck
Q
Ck
Ck
lfsr[63:0]
Figure D-7: Diagram of 64-Bit LFSR to Generate Parameter DT
The parameter DT could easily be the output of a counter or any other circuitry that produces a
unique number. For the Crypto Packet Processor an LFSR is used because of the low gate count
and improved timing.
D.3.5
Generation of Keys
The key pair, Key0 and Key1, is implemented using two 56-bit length LFSRs, one for each key. The
LFSR is based on the primitive polynomial f(x) = x56 + x54 + x52 + x49 + 1 (see Figure D-8). After
reset the Host must seed the LFSR using two write transfers. The seed for the LFSR can be any
value except all zeroes.
0
D
Q
1
48
D
Reg
Ck
Q
49
Reg
Ck
XOR
D
Q
Reg
Ck
50
D
Q
Reg
Ck
51
D
Q
52
XOR
Reg
D
Q
53
Reg
Ck
Ck
D
Q
54
XOR
D
Q
55
Reg
Reg
Ck
Ck
D
Q
56
Reg
Ck
lfsr[63:0]
Figure D-8: Diagram of 56-bit LFSR to Generate Key Pair (Key0 and Key1)
The keys could just as well be the output of a counter or any other circuitry that produces a
unique number. For the Crypto Packet Processor, an LFSR is used because of the low gate count
and improved timing.
D.3.6
Performance
The PRNG can produce a subsequent 64-bit pseudo random number every (150) system clock
cycles, or a 128-bit pseudo random number every (300) system clock cycles. See Table D-3.
Table D-3. PRNG Performance
Random Bit Rate
Number of Clock Cycles
150MHz
64-bit Pseudo Random number word rate
150
64 Mbits/sec
128-bit Pseudo Random number word rate
300
64 Mbits/sec
Note: When the Crypto Packet Processor is using AES, the PRNG should generate 128-bit
numbers and, hence, bit [2] of the PRNG Control (PRNG_CTRL) register must be set to 1 in
order to get sufficiently strong random data. For (3)DES, this bit can be set to 0 but in a
typical case where both DES and AES are used, it is recommended that you set this bit to 1
and leave it at this value. Refer to “PRNG Control Register (PRNG_CTRL)” on page 475.
378
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Input Token Definition
D.4 Input Token Definition
D.4.1
Introduction
The Crypto Packet Processor Inline Packet Engine processes packets using the instructions from
an input token. The token, read via a dedicated interface, consists of a header and a set of instructions (commands). Information in the token header initiates and controls the packet data and
context fetching. The instructions that follow the token header control the packet processing
itself, a process that is started when both context and packet data are available. Bypass data can
be located at the end of the token after the processing instructions, where it will be passed
to the result token without modification.
A token header must always contain a fixed set of four dwords. These header fields contain general information that is required for every packet such as pointers to the packet data and context
and packet length.
The token has a minimum size of five dwords. Although the token length is unlimited, a token
length is typically in the range of 10 and 30 dwords.
D.4.2
Input Token Diagram
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
-
-
U
IV
C
ToO
RC
(CT)
-
input packet length
input packet pointer
output packet pointer
transform record pointer / context pointer
context control 0 (optional)
context control 1 (optional)
IV0..1..2..3 (optional)
reserved
checksum (optional)
processing instructions
bypass token data (optional)
Note: All bits marked with dashes (-) are reserved and should be set to 0.
D.4.2.1 Input Token Header
The Input Token Header consists of the initial four (required) dwords plus the optional dwords
that may include the context control words [1:0], the IV [3:0], and a 16-bit checksum.
Token Control Word (token dword [0], Required)
This section describes the fields of the Token Control Word, the first dword of the input token.
IV: Usage and Selection
Table D-1 shows typical examples of how the IV registers can used for five types of crypto operations: DES/3DES-CBC, AES-CTR, AES-ICM and AES-CBC. Note that in the Crypto Packet
Processor, IV fields (IV0 through IV3) can be taken from four possible sources:
1.
Context Record
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
379
Appendix D: Inline Packet Engine
Table D-4. Token Control Word
Name
Description
Input Packet Length
This field must equal the number of packet bytes that needs to be fetched and processed
by the Crypto Packet Processor.
CT – Context Type (Reserved)
Only relevant when parallel operating engines are employed in combination with system
level SA management; set to ‘00’ when only one engine is employed.
For systems containing multiple instances of the engine, it is possible for one context to
be used by several engines simultaneously. If in this case the use of the shared context
also requires it to be updated after the packet processing operation, then the context
record needs to be protected from use by other engines, in particular the part of the context that may be changed as a result of such an update.
Note: Crypto Packet Processor does not support the Context Type field. Therefore the
value of these (CT) bits is reserved and should be set to ‘00’.
RC – Reuse Context
The RC field is not used and must always be ‘00’. Note, this field is also referred to as
Context Reuse.
ToO – Type of Output
These bits must be set to reflect the behavior of any post-processing instructions that can
be present in the token. These bits control the moment at which the Crypto Packet Processor allows the packet data to be read from the output buffer: in case one of the postprocessing instructions requires an update to the header of the packet (as would be the
case for Authentication Header (AH) operations), then the packet must remain in the
packet buffer until the complete packet has been processed. If the packet is larger than
the Crypto Packet Processor internal packet buffer, then the update to the header is
appended to the end of the packet data. In this case, the result token will signal type and
amount of update data via the packet info fields, bits [31:22] of the 2nd output (result)
token word. Note that the packet length field in the result token does not include this
additional update data.
If the token contains postprocessing instruction(s) requiring packet header updates, then
one of the header update options must be set. It is allowed to use one of the header
update options even if no header update instructions are present; this causes the Crypto
Packet Processor to hold the packet in its internal buffer until it is completely processed.
This allows the ToO bits to be set independently of the packet length at the expense of a
slight performance hit.
Note that when the Crypto Packet Processor is instructed to perform the header update
in its internal RAM, but the packet is too large to fit into the Crypto Packet Processor output buffer, then the Crypto Packet Processor will still append the result to the packet.
380
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Input Token Definition
Table D-4. Token Control Word (continued)
Name
Description
ToO[1:0]
00
No header update.
This setting is normally used if no header update instructions
are present in the token.
Note: If header update instructions are still present in the token, they are
discarded by the Crypto Packet Processor and no header updates will take
place.
01
Header update – small packets only
This setting causes the Crypto Packet Processor to hang on to the result packet
data, to attempt to update the packet header immediately. In case the packet is
too
large to fit into the Crypto Packet Processor internal result packet buffer, the
Crypto Packet Processor starts writing out the packet data after the result packet
has exceeded the size of 1792 bytes, and the Crypto Packet Processor will
revert
to appending the result data at the end of the packet data.
Append result to packet.
This setting prevents the Crypto Packet Processor from
updating the packet header in its internal packet buffer. Writing out of the packet
data can start as soon as enough data is available. The update data is
appended
to the end of the packet.
11
Header update and result appending.
This setting is required if the token contains
instructions for both a header update as well as result data to be appended to
the
packet. The Crypto Packet Processor will keep the packet data in its internal
packet buffer (under the same conditions as listed for option ‘01’) until the ‘STAT’
bits of one of the postprocessing instructions (see "Instruction Format")
indicates that the last header update instruction has been processed. Update
data
from any postprocessing instruction, following the ‘last header update’
postprocessing instruction, will be appended to the packet. Packet data output
will
commence after the ‘last header update’ bits have been seen. There is currently
no practical use case for this setting.
10
ToO[2]
Bit 2 of the Type of Output field is used to indicate pad removal options. This bit can only
be set for inbound operations. If the bit is set, the padding type in the context record must
be one of the following: PKCS#7, RTP, IPSec, TLS, or SSL.
0––
no pad removal
1––
remove and (optionally) verify pad
The ToO bits affect the operation of the INSERT_REMOVE_RESULT and REPLACE_BYTE instructions. Refer to “Post-Process Instructions” on page 408.
Accidental use of this bit for outbound operations can result in unwanted pad removal or
an error situation, if no padding is found.
C: Context Control Words
Present in Token
Setting this bit forces the Crypto Packet Processor to read the context control words (normally, the first two words of the context record) from the token (context control0 and context control1).
The context control words in the token are formatted identically to the command words in
the context record (see “Context Record Format” on page 433).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
381
Appendix D: Inline Packet Engine
2.
PRNG
3.
Input Token
4.
Input Packet – indirectly via Input Token processing instruction
Encryption
Algorithm
(Mode)
IV Sources
IV0
IV1
IV2
IV3
Token Control [28:26]
IV[2:0]
Context Control-1 [2:0]
Crypto Mode [2:0]
Context Control-1 [8:5]
IV3/IV2/IV1/IV0
Context Control-1 [11:10]
IV format
Table D-1. IV Register Usage
DES/3DESCBC
Input packet1
Input packet1
not used
not used
000
001
0000
00
PRNG
PRNG
not used
not used
001
001
0000
00
Context record
Context record
not used
not used
000
001
0011
00
Input packet1
Input packet1
Input packet1
Input packet1
000
110
0000
00
32’h00000001
000
010
0000
00
AES-CTR1
PRNG
PRNG
PRNG
PRNG
001
110
0000
00
(nonce)
Context record
Input packet*
Input packet*
32’h00000001
000
010
0001
01
PRNG
PRNG
32’h00000001
001
010
0001
01
sequence number
sequence number
32’h00000001
000
010
0001
10/
11
Context record
Context record
Context record
000
110
1111
00
32’h00000001
000
010
0111
01
(nonce)
Input token
AES-ICM
Input packet1
PRNG
Context record
382
Input packet1
Input packet1
32’h00000001
100
010
0000
01
PRNG
PRNG
32’h00000001
101
010
0000
01
sequence number
sequence number
32’h00000001
100
010
0000
10/
11
Context record
Context record
Input token
100
110
1110
00
32’h00000001
100
010
0110
01
Input packet1
000
111
0000
00
Input packet
with 16’h0000
000
011
0000
00
PRNG
001
111
0000
00
PRNG
with 16’h0000
001
011
0000
00
Context record
000
111
1111
00
Context record
with 16’h0000
000
011
1111
00
Input packet1
PRNG
Context record
Input packet1
PRNG
Context record
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Input Token Definition
1
Encryption
Algorithm
(Mode)
IV Sources
IV0
IV1
IV2
IV3
Token Control [28:26]
IV[2:0]
Context Control-1 [2:0]
Crypto Mode [2:0]
Context Control-1 [8:5]
IV3/IV2/IV1/IV0
Context Control-1 [11:10]
IV format
Table D-1. IV Register Usage (continued)
AES-CBC
Input packet1
Input packet1
Input packet1
Input packet1
000
001
0000
00
PRNG
PRNG
PRNG
PRNG
001
001
0000
00
Context record
Context record
Context record
Context record
000
001
1111
00
The IV words loaded from the Input Packet are retrieved from the input data stream with a RETRIEVE instruction; please refer
to D.5.4.7 “RETRIEVE Instruction” on page 403 for more details.
IVs sourced from input packet are typically used for inbound (decrypt) type operations, where the IVs are sent along with the packet data (for example, ESP decapsulation). Internally generated IVs are typically used for outbound (encrypt) operations.
AES-CTR is the underlying crypto algorithm for AES-GCM and AES-CCM. Therefore, the same IV loading mechanism applies.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
383
Appendix D: Inline Packet Engine
Selecting IV Source
The Crypto Packet Processor engine’s internal IV register (128 bits) is initially written using the
IV fields (IV0 through IV3) of the context record during a context fetch operation. Each IV field
can then be modified (overwritten) before the internal IV register is actually used with encrypt or
decrypt operations. The IV field modification options are described in this section.
In the Crypto Packet Processor, selecting IV field modification options is performed by using the
following bits:
1.
Three IV bits in the Input Token Control Word, bits [28:26].
2.
Two IV format bits of the Context Control Word 1, bits [11:10].
The IV bits of the input token control word provide for the options identified in Table D-1 below.
Since the IV format options (bits [11:10] in Context Command Word 1) can modify the IV source
further, please refer to "Summary of Possible IV Selection Result Scenarios" following this table.
Note that in addition to the IV format field, the IV3 counter value can also be modified
according to the Crypto Mode field setting, (bits [2:0] of Context Control Word 1), by automatically initializing this counter value to 32’h00000001 or 16’h0000. Refer to “Context Control Word 1
Definition” on page 439.
Table D-1. Selecting IV Source Using Input Token Control Word
384
Input Token Control
Word Bits [28:26]
IV Field Source Modification (After Initial IV Fetch from Context Record)
000
No IV source modification is required. All IV fields reflect the context record values. No IV
data taken from the input token or the PRNG.
001
Source all IV fields from the PRNG; if the PRNG is not ready, packet processing must be
held.
010
Source IV3 field from the input token and keep the context record values for the other IV
fields.
011
Source IV3 field from the input token and use PRNG for other IV fields.
100
Source IV0 and IV3 fields from the input token and keep the context values for the other IV
fields.
101
Source IV0 and IV3 fields from the input token and use PRNG for other IV fields.
110
Source IV0 and IV1 fields from the input token and keep the context values for the other IV
fields.
111
Source all four IV fields from the input token.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Input Token Definition
Summary of Possible IV Selection Result Scenarios
The figures below show how the two IV format bits of the Context Control Word 1, bits [11:10]
are used with the three IV bits in the first dword of Input Token, bits [28:26] described above, to
further modify the IV source selection, yielding eight possible Result IV scenarios.
128-bit IV register
after context load
IV0 / Nonce
IV1
IV2
IV3
‘00’:
Full IV mode
IV0 / Nonce
IV1
IV2
IV3
‘01’:
Counter mode
IV0 / Nonce
IV1
IV2
IV3
128-bit IV register
after Load IV option
(000) with ‘IV format
bits set to
:
From context
seq. num. 1
seq. num. 0
byte swap
‘10’:
Original sequence
number mode
IV0
byte swap
Sequence number
IV3
Incremented
seq. num. 1
‘11’:
Incremented sequence
Number mode
(outbound only)
seq. num. 0
byte swap
IV0
byte swap
Sequence number
IV3
Figure D-9: Result IV Using IV [2:0] = 3'b000 with IV format [11:10]
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
385
Appendix D: Inline Packet Engine
128 -bit IV register
after context load
IV 0 / Nonce
IV 1
IV 2
IV 3
128 -bit IV register
after Load IV option
(000) with ‘IV format
bits set to :
‘00’:
Full IV mode
‘01’:
Counter mode
PRNG output
IV 0 / Nonce
PRNG output
IV 3
From context
seq . num . 1
seq . num . 0
byte swap
‘10’:
Original sequence
number mode
PRNG output
byte swap
Sequence number
PRNG output
Incremented
seq . num . 1
‘11’:
Incremented sequence
Number mode
(outbound only)
seq . num . 0
byte swap
PRNG output
byte swap
Sequence number
PRNG output
Figure D-10: Result IV Using IV [2:0] = 3'b001 with IV Format [11:10]
386
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Input Token Definition
128-bit IV register
after context load
IV0 / Nonce
IV1
IV2
IV3
‘00’:
Full IV mode
IV0 / Nonce
IV1
IV2
Token IV word
‘01’:
Counter mode
IV0 / Nonce
IV1
IV2
Token IV word
128-bit IV register
after Load IV option
(000) with ‘IV format
bits set to :
From context
seq. num. 1
seq. num. 0
byte swap
‘10’:
Original sequence
number mode
IV0 / Nonce
byte swap
Sequence number
Token IV word
Incremented
seq. num. 1
‘11’:
Incremented sequence
Number mode
(outbound only)
seq. num. 0
byte swap
IV0 / Nonce
byte swap
Sequence number
Token IV word
Figure D-11: Result IV Using IV [2:0] = 3'b010 with IV Format [11:10]
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
387
Appendix D: Inline Packet Engine
128 -bit IV register
after context load
IV 0 / Nonce
IV 1
IV 2
IV 3
128 -bit IV register
after Load IV option
(000 ) with ‘IV format
bits set to :
‘00’:
Full IV mode
‘01’:
Counter mode
PRNG output
IV 0 / Nonce
Token IV word
PRNG output
Token IV word
From context
seq . num . 1
seq . num . 0
byte swap
‘10’:
Original sequence
number mode
PRNG output
byte swap
Sequence number
Token IV word
Incremented
seq . num . 1
‘11’:
Incremented sequence
Number mode
(outbound only )
seq . num . 0
byte swap
PRNG output
byte swap
Sequence number
Token IV word
Figure D-12: Result IV Using IV [2:0] = 3'b011 with IV Format [11:10]
388
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Input Token Definition
128 -bit IV register
after context load
IV 0 / Nonce
IV1
IV2
IV 3
token IV word
IV1
IV2
2
token IV word
IV1
IV2
2
128 -bit IV register
after Load IV option
(000 )
with ‘IV format
bits set to :
‘00’:
Full IV mode
1
‘01’:
Counter mode
1
st
st
nd
nd
token IV word
token IV word
From context
seq . num . 1
seq . num . 0
byte swap
‘10’:
Original sequence
number mode
byte swap
1 st token IV word
Sequence number
2 nd token IV word
Incremented
seq . num . 1
seq . num . 0
byte swap
byte swap
‘11’:
Incremented sequence
Number mode
(outbound only)
1 st token IV word
Sequence number
2 nd token IV word
Figure D-13: Result IV Using IV [2:0] = 3'b100 with IV Format [11:10]
128 -bit IV register
after context load
IV 0 / Nonce
IV1
IV2
IV 3
128 -bit IV register
after Load IV option
(000) with ‘IV format
bits set to:
‘00’:
Full IV mode
token IV word
PRNG output
token IV word
‘01’:
Counter mode
token IV word
PRNG output
token IV word
From context
seq . num . 1
‘10’:
Original sequence
number mode
token IV word
seq . num . 0
Sequence number
token IV word
‘11’: Reserved
Figure D-14: Result IV Using IV [2:0] = 3'b101 with IV Format [11:10]
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
389
Appendix D: Inline Packet Engine
128 -bit IV register
after context load
IV 0 / Nonce
IV1
IV2
IV 3
128 -bit IV register
after Load IV option
(000 ) with ‘IV format
bits set to:
‘00’:
Full IV mode
token IV word
token IV word
IV2
IV 3
‘01’:
Counter mode
token IV word
token IV word
IV2
IV 3
From context
seq . num . 1
seq . num . 0
byte swap
‘10’:
Original sequence
number mode
token IV word
byte swap
Sequence number
IV 3
Incremented
seq . num . 1
‘11’:
Incremented sequence
Number mode
(outbound only )
seq . num . 0
byte swap
token IV word
byte swap
Sequence number
IV 3
Figure D-15: Result IV Using IV [2:0] = 3'b110 with IV Format [11:10]
390
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Input Token Definition
128 -bit IV register
after context load
IV 0 / Nonce
IV1
IV2
IV 3
128 -bit IV register
after Load IV option
(000 ) with ‘IV format
bits set to :
‘00’:
Full IV mode
token IV word
token IV word
token IV word
token IV word
‘01’:
Counter mode
token IV word
token IV word
token IV word
token IV word
From context
seq . num . 1
seq . num . 0
byte swap
‘10’:
Original sequence
number mode
token IV word
byte swap
Sequence number
token IV word
Incremented
seq . num . 1
seq . num . 0
byte swap
‘11’:
Incremented sequence
Number mode
(outbound only )
token IV word
byte swap
Sequence number
token IV word
Figure D-16: Result IV Using IV [2:0] = 3'b111 with IV Format [11:10]
U: Upper Layer Header from Token
If this bit is set, the checksum register in the Crypto Packet Processor context space is written
with the checksum value supplied from the token. The checksum value in context space can
then be used to either, compare against the checksum in the packet header or it can be inserted in
the packet header. See D.5.4.3 “INSERT Instruction” on page 398 and D.5.4.7 “RETRIEVE Instruction” on page 403.
Input Packet Pointer (token dword [1], Required)
Must be 0.
Output Packet Pointer (token dword [2], Required)
Must be 0.
Context Control Words [1:0] (token dwords [5:4], Optional)
Optionally, the Context Control Words [1:0] can be placed in the token at dword location [5:4], as
specified by setting the C field in the first dword of a token header. Refer to “C: Context Control
Words Present in Token” on page 381.
IV [3:0] (Optional)
Optionally, IV [3:0] can be placed in the input token at dword locations [9:6].
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
391
Appendix D: Inline Packet Engine
Processing Instructions (One Instruction Minimum)
Refer to the following D.5 Processing Instructions, for a detailed discussion and definition of all
Processing Instructions supported by the Crypto Packet Processor.
Bypass Token Data (Optional)
This field contains data that must be bypassed from the input token to the result token and can
have any length between 0 and 4 dwords. The first bypass data word must contain the bypass
opcode.
D.5 Processing Instructions
D.5.1
Instruction Types
There are five types of processing instructions. The first 2 types (Operational Data and IP
Header Instructions) are executed by the Crypto Packet Processor preprocessor and can occur in
any order. These preprocessor instructions create the different data streams for crypt, hash, and
output.
The third type executes on the result data stream by the Crypto Packet Processor post-process
module. Instructions of Type 3 can be mixed with the Type 1 and 2 instructions. Type 3 instructions can modify or append result data fields in the output packet.
Note: Types 1, 2, and 3 instructions are also referred to as execution instructions, since these are
the only instructions “executed” by the pre and post-processors.
Type 4 instructions are executed by the Crypto Packet Processor control module and can
only occur after Types 1, 2, and 3 instructions.
Type 5 instructions are also executed by the Crypto Packet Processor control module and
can occur before or after the Types 1, 2, and 3 instructions.
Type 4 and 5 instructions are executed in the control module and do not affect the result
data. Only context records and result tokens contain the results of these instructions.
The following subsections group the processing instructions under their appropriate instruction
type. Also shown are the key fields used by each instruction.
D.5.1.1 Operational Data Instructions (Type 1)
The following operational data instructions are executed by preprocessor:
1. DIRECTION
2.
PRE_CHECKSUM
3. INSERT
4. INSERT_CTX
5. REPLACE
6. RETRIEVE
7. MUTE
D.5.1.2 IP Header Instructions (Type 2)
The following IP header instructions are executed by preprocessor:
1.
IPV4_CHECKSUM
2. IPV4
392
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
3. IPV6
D.5.1.3 Post-Process Instructions (Type 3)
After packet processing, the Crypto Packet Processor can perform the following instructions:
1.
INSERT_REMOVE_RESULT
2. REPLACE_BYTE
D.5.1.4 Result Instructions (Type 4)
There is currently only one instruction of this type:
VERIFY_FIELDS
D.5.1.5 Context Control Instructions (Type 5)
There is currently only one instruction of this type:
CONTEXT_ACCESS
D.5.1.6 Special Instructions (Type 6)
There is currently only one instruction of this type:
BYPASS
D.5.2
Instruction Sequencing
Token processing instructions are restricted to the sequencing shown in Table D-1.
Table D-1. Instruction Sequencing
Sequencing of Instructions
Context Control Instructions with the ‘result type’ field equal to ‘00’
(See Note 2)
Execution Instructions (Instruction Types 1, 2 and 3)1
Context Control Instructions with the ‘result type’ field equal to ‘00’2
(See Note 2)
Result Instructions (Instruction Type 4)
Context Control Instructions (Instruction Type 5)
1
2
Execution instructions are all instructions executed by the pre/post-processor
‘Result type’ = ‘00’ refers to Context Control Instructions that always execute, and therefore have their F and P instruction
fields set to ‘0’. (Refer to “Context Control Instructions” on page 425.
D.5.2.1 Sequencing Rules
The following rules apply when using instructions:
•
Context Control Instructions can never be mixed with the execution instructions (Types 1, 2,
and 3).
•
The Result Instructions must be located after the execution instructions.
•
A token can contain only one Result Instruction.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
393
Appendix D: Inline Packet Engine
•
Context Control instructions that occur after execution instructions and before the Result
Instruction, must have field result type equal to 00’ (execute always). Refer to “Context Control
Instructions” on page 425.
D.5.3
Instruction Format
The general format for all token processing instructions is described in this section.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
instruction dependent fields
STAT
length / offset
data (optional)
Table D-2. Instruction Format
Bits
Name
Description
31:28
opcode
Each instruction has a unique operation code (opcode).Table D-3 provides a
complete list of instruction operation codes. The opcodes for reserved
instructions can not be used.
27:19
Instruction dependent fields
The definition of fields within bits [27:19] is dependent on the instruction
used. Therefore, for details on how these bits are used, refer the specific
instruction in D.5 Processing Instructions.
18:17
STAT
STAT Definition for Type 1 and 2 Instructions.
The STAT field is used to make sure cryptographic and authentication operations are completed. As long as the status bits are 0, all data streams expect
to receive more data after execution of this instruction. If the hash status bit
(bit 17) is set, the current instruction passes the last hash data. If the last data
bit (bit 18) is set, the current instruction has operated on the last data bit from
the datapath through the packet-processing engine.
Encoding of the STAT field for Type 1 and 2 instructions:
00
processing
01
last hash data for hash engine
10
last data for packet processing
11
last hash data for hash engine and last data for packet processing
STAT Definition for Type 3 Instructions.
For the Insert result (opcode: 1011) and replace (opcode: 1100) instruction these bits will be used differently:
If the checksum status bit (bit 17) is set, the modification does not modify the
checksum calculated by the postprocessing.
If the last insert bit (bit 18) is set, this instruction will be the last instruction
that inserts data into the data stream. After execution of this instruction, it is
not required for this packet to hold data in the packet buffer.
Encoding of the STAT field for postprocessing instructions:
00
checksum modification and not last INSERT instruction
01
no checksum modification required
10
last insert header instruction; remaining post-process instructions
are append data
11
no checksum modification required and last ‘insert header’
instruction
The use of the last insert bit (for postprocessing instructions) is optional; it
can improve the performance.
394
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
Table D-2. Instruction Format (continued)
Bits
Name
Description
16:00
length/offset
Length Pointer.
length
If a length field applies, indicates the number of bytes that are to be processed by the instruction. For Type 1 instructions, the length field must be
greater than 0.
offset
If an offset field applies, it is the offset into the output stream. All data bytes
are written into the data stream starting at this offset position.
31:00
data (Optional)
The data appended to an instruction must be specified in little-endian format.
All data words to be inserted are concatenated after the instruction. The
length field indicates the number of bytes that needs to be inserted into
the data stream, which equals the data appended after the instruction in the
token.
Table D-3. Instruction Operation Codes
Operation Code
Instruction
Instruction Type
0000
DIRECTION
Operational Data Instruction (Type 1)
0001
PRE_CHECKSUM
0010
INSERT
1001
INSERT_CTX
0011
REPLACE
0100
RETRIEVE
0101
Mute
0111
IPV4 (with checksum untouched)
0110
IPV4_CHECKSUM
1000
IPV6
1010
INSERT_REMOVE_RESULT
1011
REPLACE_BYTE
1100
reserved
n/a
1101
VERIFY
Result Instruction (Type 4)
1110
CONTEXT_ACCESS
Context_Control Instruction (Type 5)
1111
BYPASS_TOKEN_DATA
Special Instruction (Type 6)
IP Header Instruction (Type 2)
Post-processing Instruction (Type 3)
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
395
Appendix D: Inline Packet Engine
D.5.4
Operational Data Instructions
D.5.4.1 Direction Instruction
This instruction does not modify any data; it only passes a number of bytes to the crypto engine,
hash engine, output buffer, or combinations of these. The length field (in bytes) indicates the
amount of packet data to be transferred by this instruction. The t.o.dest. field determines the
destination(s). The L bit can be used to indicate the last block for the crypto or hash engines; this
bit can optionally be used to improve performance in the case of a block cipher and must be used
in case of stream ciphers and (GHASH) GCM.
Note: Token instructions sending data to the crypto engine must be following each other directly,
that is, the input to the crypto engine must be provided as a single, uninterrupted, block of
data. There cannot be instructions ‘in between’ sending data to destinations, other than the
crypto engine. The crypto engine cannot handle any data alignment issues that may occur if
intermediate data handling instructions occur between two instructions targeting the
crypto engine.
DIRECTION
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
0
0
0
L
t.o. dest.
reserved
–
–
–
–
–
–
–
–
–
STAT
length
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-4. DIRECTION Definition
Bits
Name
Description
31:28
opcode
Each instruction has a unique operation code (opcode).Table D-3 provides a
complete list of instruction operation codes. The opcodes for reserved
instructions can not be used. Refer to "opcode" for a description of this field.
27
L (Last)
When the ‘L’ bit is set, and the crypto data bit (bit 26) in the ‘t.o dest.’ field is
also set, then this is the last crypto data block. In the case of CTR or ICM
mode, the ‘L’ bit must be set for the last block; for all other modes this bit is
optional.
When the ‘L’ bit is set, and the hash bit (bit 25) in the ‘t.o dest.’ field is also
set, while the crypto bit (bit26) is NOT set, then this is the last hash data
before crypto data. The ‘L’ bit is only required for a GHASH (GCM) operation
where the AAD data needs to be hashed in a separate operation before the
crypto data is hashed.
396
26:24
t.o.dest.
Type of destination. See Table D-5 below.
23:19
reserved
reserved.
18:17
STAT
Refer to STAT field in Table D-2 for a description of this field.
16:00
length
Refer to "length" on 395 for a description of this field.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
Table D-5. Type of Destination Field Description
Value
Type of Destination
Use Example
000
reserved
n/a
001
output only (for bypass data)
IP header
010
hash engine only
extended sequence number
011
hash engine and output
protocol header
100
crypto engine only
n/a
101
crypto engine and output
encrypt only payload
110
crypto engine and hash engine
n/a
111
crypto engine, hash engine and output
ESP payload
D.5.4.2 PRE_CHECKSUM Instruction
The PRE_CHECKSUM instruction is used to update 16-bit checksums during preprocessing. This
instruction is typically used in performing NAT operations where checksum updating is required
to reflect updated IP addresses or port numbers.
PRE_CHECKSUM
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
0
0
1
L
t.o. cmd.
origin
–
1
0
1
1
1
0
1
0
STAT
checksum update value
0
–
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-1. PRE_CHECKSUM Instruction Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L (Last)
The description for the ‘L’ field is the same as in “Direction Instruction” on
page 396.
Note that when PRE_CHECKSUM is used with IP protocol, L is always set to
‘0’, since the checksum is never last.
26:24
t.o.cmd. (type of command)
This field must be set to ‘111’.
23:19
origin
The origin field must be set to ‘01010’.
18:17
STAT
Refer to Table D-2 for a description of this field.
16:00
checksum update value
This value must be the difference between the old checksum value and the
new desired checksum value. The checksum update value is generated as
follows:
XOR original values with 0xFFFF (invert). Use this result to perform a 16-bit
‘ADD with carry’ with the ‘checksum update value’ field of the PRE_CHECKSUM instruction. The result of this calculation is then XOR’d with 0xFFFF and
replaces the original value in the data stream.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
397
Appendix D: Inline Packet Engine
D.5.4.3 INSERT Instruction
The INSERT instruction is used to insert data into the data stream before the data enters the
packet processing module. The INSERT instruction inserts data from the internal register bank,
such as the context registers or data directly from the token following the instruction (also
referred to as an INSERT immediate). The data source to be inserted is selected by the origin
field.
Note that in Table D-2, some of the internal registers are ordered to allow multiple fields to be
inserted with one instruction. For example, the registers highlighted in yellow (SPI, sequence
number result, and IV0 through IV3) are typically inserted with one instruction.
In addition to inserting data into the data stream, the INSERT instruction can also insert one of
several types of padding, zero padding, PKCS#7, Constant, RTP, IPSec, TLS, SSL, or TFC. Padding is applied if the two most significant bits of the origin field are set to ‘00’.
INSERT
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
0
1
0
L
t.o. dest.
padding/origin
–
–
–
–
–
–
–
–
–
STAT
extended length / insert value
length
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-2. INSERT Instruction Definition
398
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L (Last)
The description for the ‘L’ field is the same as in “Direction Instruction” on
page 396.
26:25
t.o.dest. - type of destination
The values for the ‘t.o.dest.’ field are the same as those listed in “Direction
Instruction” on page 396.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
Table D-2. INSERT Instruction Definition (continued)
Bits
Name
Description
23:19
padding/origin
The INSERT instruction uses of the padding/origin field for one of two possible purposes, padding or origin.
1. Padding Use: MSbs = ‘00’
If the two MSbs are ‘00’, then the three LSBs indicate the type of padding to
be used.
For Constant, RTP, SSL, IPSec padding: bits [16:9] contain the value that
needs to be inserted.
In case of Constant, RTP and SSL padding, the insert value represents the
constant padding value. The location of the inserted value is indicated by the
value ‘99’ in the examples below.
For TFC, the full 17 bits are used to indicate the length.
Possible padding values for this field are listed in the following table:
Padding Type
Padding Sequence
00000
zero padding
00-00-00-00-00-00
Note: This padding type can also be used to
insert
a number of ‘0’ bytes (1 to 511) into the data
stream.
00001 PKCS#7
06-06-06-06-06-06
00010 Constant
99-99-99-99-99-99
00011 RTP
99-99-99-99-99-06
00100 IPSec1
01-02-03-04-04-99
(in this case, ‘99’ should contain the NH value
from
00101 TLS
00110 SSL
00111 TFC
the original datagram)
05-05-05-05-05-05
99-99-99-99-99-05
00-00-…-00-00-…00-00
The ‘extended length’ field applies to this pad-
ding
selection.
1
18:17
STAT
Refer to the definition for the STAT field in Table D-2.
16:09
extended length / insert value
Depending on the value of the padding/origin field, this field is interpreted as
either the ‘extended length’ (the upper 8 bits of the length value) or the ‘insert
value’ (the value to be used for certain padding modes, for example Constant,
RTP, and SSL.
8:00
length
Refer to "length" on 395 for a description of this field. Note that when padding
is inserted, this field indicates the total number of the padding bytes.
Example: To add 14 IPSec padding bytes, length field and next header, specify the length value of 16 (0x10 in hexadecimal). To add 20 bytes of SSL or
TLS padding and length field, specify the length value of 21 (0x15 in hexadecimal).
To create a packet that uses IPSec padding filled with zeros (00-00-00-00-04-NH), two token instructions should be used:
a) INSERT the SSL pad (length is 4+1=5 bytes), to insert the sequence (00-00-00-00-04) INSERT 1 byte from token, where 1
byte is a NH value.
b) Origin Use: MSbs = not ‘00’.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
399
Appendix D: Inline Packet Engine
Possible origin values for this field are listed in Table D-3.
Table D-3. ‘Origin’ Field Encoding for Preprocessing Instructions
‘origin’ Field
Value
INSERT Instruction — Internal Register(s) to be Inserted
R/W Status
01000
seq_num result – copy of 10011
For outbound, this is the incremented result of the sequence number from context.
R/W
01001
extended sequence number result
For inbound, this is an estimation based on sequence number from context and
retrieved sequence number data.
For outbound, this is the incremented result of the 64-bit sequence number from context.
Note: This origin value should be used for inserting the IPSec extended sequence
number into the data stream for use by the hash engine.
R/W
01010
seq_num (from context)
R
01011
extended sequence number (from context)
R
01100
seq_num (from context) – copy of 01010
R
01101 - 01111
reserved
-
10000
General purpose register 0
R/W
10001
General purpose register 1
R/W
10010
SPI
If length is 8, SPI and seq_num result is inserted.
If length is 16 or 24, SPI, seq_num result, and IV are inserted.
W – SPI result
R – SPI active
10011
seq_num result
R/W
10100
IV0 – first IV word
R/W
10101
IV1 – second IV word
R/W
10110
IV2 – third IV word
R/W
10111
IV3 – fourth IV word
R/W
11000
SPI result register
R
11001
checksum – 16-bit checksum value optionally located in the 16 LSBs of input token.
R/W
11010
checksum calculation store
R/W
11011
Indicates that this is an ‘INSERT (immediate)’ instruction. See “INSERT Instruction”
on page 398 Example- INSERT (immediate) Table D-4.
-
11100
hash result digest from hash engine.
Note: Can be any length up to 16 dwords (512 bits).
R/W
11101 - 11111
reserved
-
If the length exceeds the fields of one origin type, it will continue with the next field that is
located on the next origin location. Be aware that after the fourth hash word, the fifth up to the
16th hash-word are read.
400
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
INSERT (immediate)
The INSERT (immediate) instruction inserts data immediately following the instruction into
the data stream. It is identified by the value ‘5b11011’ in the origin field.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
0
1
0
L
t.o. dest.
origin
–
–
1
–
–
1
0
1
1
STAT
length
–
0
–
0
0
0
0
0
0
0
–
–
–
–
–
–
–
–
–
data
data
Table D-4. INSERT (immediate) Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L (Last)
The description for the ‘L’ field is the same as in “Direction Instruction” on
page 396.
26:24
t.o.dest. - type of destination
The values for the ‘t.o.dest.’ field are the same as those listed in “Direction
Instruction” on page 396.
23:19
origin
The value of ‘0b11011’ indicates that this is an INSERT (immediate) instruction.
18:17
STAT
Refer to Table D-2, “Instruction Format,” on page 394 for a description of this
field.
16:00
length
Refer to "length" on 395 for a description of this field.
31:00
data
Refer to “Processing Instructions” on page 392, for a description of this field.
D.5.4.4 INSERT Instruction Example – NOP
The following INSERT instruction can be used as NOP or dummy instruction.
NOP
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
0
1
0
L
t.o. dest.
padding/origin
0
0
0
0
0
0
0
0
0
STAT
reserved
–
0
–
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
This instruction does not process any data and takes exactly one clock cycle to process.
Please note that the sequencing rules must still be respected and that since the t.o.dest is 0 (the
crypto-bit is not set) the cryptographic data steam may not be interrupted with this NOP
instruction.
Table D-5. NOP Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L (Last)
The description for the ‘L’ field is the same as in “Direction Instruction” on
page 396.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
401
Appendix D: Inline Packet Engine
Table D-5. NOP Definition (continued)
Bits
Name
Description
26:24
t.o.dest. - type of destination
The values for the ‘t.o.dest.’ field are the same as those listed in “Direction
Instruction” on page 396.
23:19
origin
The value of ‘0b11011’ indicates that this is an INSERT (immediate) instruction.
18:17
STAT
Refer to Table D-2, “Instruction Format,” on page 394 for a description of this
field.
16:00
reserved
Reserved.
D.5.4.5 INSERT_CTX Instruction
This instruction is intended to insert:
1.
Data from token to the output stream, and at the same time, write these data to the context
record.
2.
Data from context record to the output stream, and at the same time, write these data to
another location in context record.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
0
0
1
L
t.o. dest.
origin
–
–
–
–
–
–
–
–
–
STAT
context dest.
–
–
–
–
–
–
–
reserved
length
0
0
0
0
–
–
–
–
–
–
–
–
Table D-1. INSERT_CTX Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L (Last)
The description for the ‘L’ field is the same as in “Direction Instruction” on
page 396.
26:24
t.o.dest
Destination in the datapath.
23:19
origin
Indicates origin of data.
18:17
STAT
Refer to "STAT" for a description of this field.
16:12
context_dest
Indicates destination in the context record. When this field is 0, no write to the
context record is performed.
11:9
reserved
reserved.
8:00
length
Refer to "length" on 395 for a description of this field.
Note: When using the INSERT_CTX instruction to read fields from the context that were updated
by a previous instruction, another instruction must used before the INSERT_CTX
instruction. This can be any instruction or simply the NOP instruction. This assures the
Crypto Packet Processor has time to commit the updated context values before they can be
read.
402
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
D.5.4.6 REPLACE Instruction
The REPLACE instruction operates similarly to the INSERT instruction, except that it overwrites
the input data in the data stream instead of inserting it. The REPLACE instruction is used to overwrite data in the data stream before the data enters the packet processing module.
REPLACE
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
0
1
1
L
t.o. dest.
origin
–
–
–
–
–
–
–
–
–
STAT
length
–
0
–
0
0
0
0
0
0
0
–
–
–
–
–
–
–
–
–
Table D-2. REPLACE Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L (Last)
The description for the ‘L’ field is the same as in “Direction Instruction” on
page 396.
26:24
t.o.dest. - type of destination
The values for the ‘t.o.dest.’ field are the same as those listed in “Direction
Instruction” on page 396.
23:19
origin
Refer to the origin field description in Table D-3. Origin values 00000-00111,
01100, 01111 and 11101-11111 are not applicable for the REPLACE instruction. Two additional origins are available for this instruction only.
01101 Increment current value, one byte from the input data stream is
incremented with ‘1’. "length" on 395 field must be set to 1
(17’h00001) using.this origin.
01110
Decrement current value, one byte from the input data stream is
decremented with ‘1’. "length" on 395 field must be set to 1
(17’h00001) using.this origin.
18:17
STAT
Refer to Table D-2, “Instruction Format,” on page 394 for a description of this
field.
16:00
length
Refer to "length" on 395 for a description of this field.
D.5.4.7 RETRIEVE Instruction
The purpose of the RETRIEVE instruction is to retrieve length bytes of data from the input data
stream, starting from the point where the previous instruction stopped processing, and send this
data to a different location or simply remove it from the data stream.
Note that these instructions differ from the INSERT_REMOVE_RESULT and instruction that modify the output data stream (see “Post-Process Instructions” on page 408).
RETRIEVE
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
1
0
0
L
t.o. dest.
origin
–
–
–
–
–
–
–
–
–
STAT
length
–
0
–
0
0
0
0
0
0
0
–
–
–
–
–
–
–
–
–
The following subsections show three examples of the RETRIEVE instruction.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
403
Appendix D: Inline Packet Engine
Table D-3. RETRIEVE Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L (Last)
The description for the ‘L’ field is the same as in “Direction Instruction” on page 396.
26:24
t.o.dest.
type of destination: The values for the ‘t.o.dest.’ field are the same as those listed in
“Direction Instruction” on page 396.
23:19
origin
Refer to the origin field description in Table D-3. Origin values 00000-00111, 01100,
01111 and 11101-11111 are not applicable for the RETRIEVE instruction.
18:17
STAT
Refer to Table D-2, “Instruction Format,” on page 394 for a description of this field.
16:00
length
Refer to "length" on 395 for a description of this field.
RETRIEVE (Copy, Store, and Pass) Instruction Example
This RETRIEVE instruction copies data from the input data stream, stores the data in context registers (destination register controlled by the origin field) and then passes the same data to the
processing engine (under control of the ‘type of destination’ (ToP) field).
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
1
0
0
L
t.o. dest.
origin
–
–
–
–
–
–
–
–
–
STAT
length
–
0
–
0
0
0
0
0
0
0
–
–
–
–
–
–
–
–
–
RETRIEVE (Remove and Store) Instruction Example
This RETRIEVE instruction removes data from the input data stream and only stores data in the
context registers. Note that the L and t.o.dest. fields are set to all zeroes. The origin field is
used to indicate which context registers need to be overwritten.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
1
0
0
L
t.o. dest.
origin
0
0
–
0
0
–
–
–
–
STAT
length
–
0
–
0
0
0
0
0
0
0
–
–
–
–
–
–
–
–
–
RETRIEVE (Remove Only) Instruction Example
This RETRIEVE instruction simply removes data from the input data stream and does not store
the data or pass the data to the processing engine. The L and t.o. dest. fields are set to zeros
and origin field is set to 11011).
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
1
0
0
L
t.o. dest.
origin
0
0
1
0
0
1
0
1
1
STAT
length
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
D.5.4.8 MUTE Instruction
The MUTE instruction performs a bitwise AND of the input data from the packet with the mask data
immediately following the MUTE instruction.
The MUTE instruction typically sends data to two destinations at the same time. This first destination specified by the t.o. dest. field receives the muted version of the data, and the second
destination specified by the t.o.dest 2 field receives the original, unmuted data. Typically the
t.o. dest. destination is always set to ‘010’ (hash engine) and the t.o.dest 2 destination is
always set to ‘001’ (the output data stream).
404
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
The length field indicates the number of bytes to be muted, up to 508 bytes. The specified length
must be a multiple of four bytes, (bits [1:0] = ‘00’).
MUTE
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
1
0
1
r
t.o. dest.
r
t.o. dest. 2 M
STAT
length
0
0
0
0
–
0
–
–
–
–
1
–
0
0
0
0
0
0
0
–
–
–
–
–
–
–
0
0
mask0
….
mask(n-1)
Table D-4. MUTE Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
r
Reserved, set to ‘0’.
26:24
t.o.dest.2 – type of
destination 2
The values for the ‘t.o.dest.’ field are the same as those listed in “Direction
Instruction” on page 396, except destination to the crypto engine is not allowed
(bit [22] must be set to ‘0’). Typically this field is always set to ‘001’ (output).
This destination receives the original, unmuted data.
23
r
22:20
t.o.dest. - type of
destination
The values for the ‘t.o.dest.’ field are the same as those listed in “Direction
Instruction” on page 396, except destination to the crypto engine is not allowed
(bit [26] must be set to ‘0’). Typically, this field is always set to ‘010’ (hash
engine).This destination receives the muted version of the data.
19
M
Indicates that the mask is appended to the instruction. If M is set to 0, no mask
fields are supplied and the instruction assumes that ‘length’ bytes of zeros are
to be used as mask values. M = ‘0’ could be used with IPv6 extension headers
that contain complete 32-bit words to be muted.
18:17
STAT
Refer to Table D-2, “Instruction Format,” on page 394 for a description of this
field.
16:00
length
The "length" on 395 field indicates the number of bytes to be muted, up to 508
bytes. The specified length must be a multiple of four bytes, (bits [1:0] = ‘00’).
32:00
mask
The mask data immediately following the MUTE instruction to be used in the
bitwise AND operation with the input packet data. Refer to “data (Optional)” on
page 395.
Reserved, set to ‘0’.
Note: If the same value is set for both destination fields, the MUTE instruction will send both
muted and unmuted data to the same destination.
D.5.5
IP Header Instructions
There are three IP header instructions.
•
IPv4 (with checksum cleared)
•
IPv4_CHECKSUM (checksum updated)
•
IPv6
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
405
Appendix D: Inline Packet Engine
All instructions involve only the first words of the IP header. In the case of IPv4, the first three
words are processed by the IP Header instruction, while in the case of IPv6 only the first 2 words
are processed.
D.5.5.1 IPv4 Instruction
The IPv4 instruction processes the first three words of the IPv4 header and passes them according to the destination specified in t.o.dest. field. It inserts the length and protocol fields of
the instruction into the corresponding fields in the IP header. It also clears the IPv4 checksum to
0 and if the D field of the instruction is set to 1, decrements the IPv4 time to live field.
Table D-5 illustrates which IPv4 header fields are inserted, updated, or untouched.
IPv4 (With Checksum Untouched)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
0
1
1
1
D
t.o. dest.
protocol
–
–
–
–
–
–
length
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-5. IPv4 (with Checksum Untouched) Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
D – ‘time to live’ decrement
The IPv4 ‘time to live’ value is decremented when ‘D’ is set to ‘1’.
26:24
t.o.dest.
Destination for the processed header words. The values for the ‘t.o.dest.’ field are
the same as those listed in “Direction Instruction” on page 396.
23:16
protocol
The value to be placed in the ‘protocol’ field of the IPv4 header.
15:00
length
The value (in little-endian format) to be placed in the ‘length’ field of the IPv4
header. The IPv4 instruction takes care of the byte swapping.
Modifications
4-bits
32-bits
version
IHL
type of service
identifier
time to live
length
flags
protocol
replaced with the value
from the instruction
fragment offset
checksum
updated
Figure D-17: IPv4 (with Checksum Untouched) Instruction: Datagram Modifications
Refer to Table D-3.
406
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
D.5.5.2 IPv4_CHECKSUM Instruction
IPv4 checksum instruction processes the first three words of the IP header and passes them
according to the t.o.dest. field. It inserts the length and protocol fields of the instruction
into the corresponding fields in the IP header. It also updates the IPv4 checksum and if the D
field of the instruction is set to ‘1’, decrements the IPv4 ‘time to live’ field. Figure D-20 shows
which IPv4 header fields are inserted or updated.
IPv4_CHECKSUM
31 30 29 28
27 26 25
opcode
D
t.o. dest.
–
–
0
1
1
0
–
24 23 22 21 20 19 18 17 16
protocol
–
–
–
–
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
length
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-6. IPv4_CHECKSUM Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
D – ‘time to live’
decrement
The IPv4 ‘time to live’ value is decremented when ‘D’ is set to ‘1’.
26:24
t.o.dest.
Destination for the processed header words. The values for the ‘t.o.dest.’ field are the
same as those listed in “Direction Instruction” on page 396.
23:16
protocol
The value to be placed in the ‘protocol’ field of the IPv4 header.
15:00
length
The value (in little-endian format) to be placed in the "length" on 395 field of the IPv4
header. The IPv4 instruction handles the byte swapping.
Modifications
4-bits
32-bits
version
IHL
type of service
identifier
time to live
length
flags
protocol
fragment offset
checksum
replaced with the value
from the instruction
updated
Figure D-18: IPv4_CHECKSUM Instruction: Datagram Modifications
D.5.5.3 IPv6 Instruction
IPv6 instruction processes the first 2 words and passes them according to the destination specified by the t.o.dest. field. It inserts the instruction’s payload length and next header
fields in the corresponding fields of the IPv6 header. The D field determines if the IPv6 ‘hop limit’
value is decremented.
Figure D-19 illustrates which fields of the IPv6 header are removed, inserted, or updated.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
407
Appendix D: Inline Packet Engine
IPv6
31 30 29 28
27 26 25
opcode
D
t.o. dest.
–
–
1
0
0
0
–
24 23 22 21 20 19 18 17 16
next header
–
–
–
–
–
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
payload length
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-7. IPv6 Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
D – ‘hop limit’ decrement
The IPv6 ‘hop limit’ value is decremented when ‘D’ is set to ‘1’.
26:24
t.o.dest.
Destination for the processed header words. The values for the ‘t.o.dest.’ field
are the same as those listed in “Direction Instruction” on page 396.
23:16
next header
The value to be placed in the ‘next header’ field of the IPv6 header.
15:00
payload length
The value (in little-endian format) to be placed in the ‘payload length’ field of
the IPv6 header. The IPv4 instruction takes care of the byte swapping.
Modifications
4-bits
32-bits
version
traffic class
flow label
payload length
next header
replaced with the value
from the instruction
hop limit
optionally decremented
Figure D-19: IPv6 Instruction: Datagram Modifications
D.5.6
Post-Process Instructions
Post-processing instructions are used to modify packet data after the packet has been processed
by the packet processing module. Typically this includes updating packet headers, for example,
inserting the ICV in the AH header or adding/removing data from the packet. Post-processing
instructions are executed by the post-processing module.
There are two post-process instructions, the INSERT_REMOVE_RESULT instruction and the
REPLACE_BYTE instruction.
The REPLACE_BYTE instruction can be used to replace a single byte in the output data stream
with a value from the instruction.
Please note that both instructions have limited functionality in the Crypto Packet Processor configurations due to the minimal sized output buffer. These two instructions can only append result
data in the Crypto Packet Processor, where the two instructions can also update fields in the
packet in the standard Crypto Packet Processor.
408
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
D.5.6.1 INSERT_REMOVE_RESULT (IRR) Instruction
The INSERT_REMOVE_RESULT instruction (also referred to as the IRR instruction) can be used as
an INSERT or REMOVE operation. It can take result data from the hash result registers or the status
registers and INSERT (actually replace) this data in the appropriate fields of the output data
stream or append it to the end of the data stream. This same instruction can also be used to
REMOVE a hash compare value from the decrypted data stream starting from any byte.
The IRR instruction functions as a remove_result operation when all of the following fields are set
to ‘0’: L, NH, CS and P. Otherwise, this instruction functions as an insert_result operation for
updating the length, next header, and checksum fields in IPv4 or IPv6 headers (refer to
“INSERT_REMOVE_RESULT Instruction (remove_result Operation)” on page 411).
The IRR instruction is typically used to update certain fields in the output packet with the new
values. The common “use case” is the insertion of the ICV when performing AH outbound operations or updating the IP header fields when performing ESP inbound (transport mode)
operations.
However, in the Crypto Packet Processor configurations, the IRR instruction is not able to perform updates of the header data in the output buffer. This logic is disabled for the Crypto Packet
Processor configurations because of the minimum sized data output buffer. Instead, the input
token must use a special mode of this instruction to append the data for update to the end of the
packet. A module external to the Crypto Packet Processor must inspect the output token, which
contains information about the data that is appended. This module must be aware of the protocol
‘use case’ and must replace the required fields using information from the appended data. These
actions are similar to those for “jumbo”-frames in the standard Crypto Packet Processor.
Please refer to “Inline Packet Engine — Token Examples” on page 491 for the specific AH outbound and ESP inbound transport mode tokens that can be used within the Crypto Packet
Processor configurations.
Using the IRR instruction to remove an encrypted block of zeroes from the output buffer (used in
AES-GCM/GMAC/CCM modes) is not affected in the Crypto Packet Processor and remains the
same.
Ordering and alignment of appended data to the output data stream: When using the IRR
instruction to append updated IP header fields to the end of the output data stream, the field
order and byte alignment is fixed. When more than one of the L, NH, CS fields are to be appended
to the data stream, the updated fields (if selected) will always be appended in the following order:
length field if selected, then next header field if selected, and finally the checksum field if
selected. These fields are always appended as 32-bit dwords with byte alignment within the
dword as follows for IPv4:
MSB
Alignment of Appended IP Header Fields (IPv4)
LSB
1st appended
dword if
selected
16’h0000
‘total length’ (IPv4)
dword bits [15:0]
2nd appended
dword if
selected
16’h0000
‘protocol’ (IPv4)
dword bits [15:8]
3rd appended
dword if
selected
16’h0000
‘checksum field’ (IPv4 only)
dword bits [15:0]
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
8’h00
409
Appendix D: Inline Packet Engine
For all other cases (IPv6) the alignment is as follows:
MSB
410
Alignment of Appended IP Header Fields (IPv6)
1st appended
dword if
selected
16’h0000
‘payload length’ (IPv6)
dword bits [15:0]
2nd appended
dword if
selected
16’h0000
8’h00
LSB
‘next header’ (IPv6)
dword bits [7:0]
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
INSERT_REMOVE_RESULT Instruction (remove_result Operation)
This section describes the INSERT_REMOVE_RESULT instruction functioning as a remove_result operation.
The remove_result operation is specifically used for the removal of the hash result or an
encrypted value (see description of bit 17 of the Context Control Word1 below) from the output
data stream, allowing for the subsequent use of the removed value by the VERIFY_FIELDS
instruction. This differs from the RETRIEVE instruction, which can only extract data from the
input data stream.
Note that the postprocessing module can only buffer one instruction at a time; execution of the
current post-process instructions must be completed before the next instruction can be passed to
the postprocessing module. Therefore, the order of these instructions within the token is important, specifically in case of the remove_result operation in combination with additional
postprocessing instructions.
The remove_result operation needs to wait in the postprocessing module for the actual data to
be removed before passing through this module. The control module continues processing
instructions from the token, but if an additional postprocessing instruction is encountered, further instruction processing in the control module is halted until the postprocessing module has
successfully finished with its current remove_result operation. Only then can the postprocessor module accept the new postprocessing instruction, allowing the control module to continue
processing.
When constructing tokens, also be aware of the following. The remove_result operation contains an offset to a location in the packet data stream. If the remove_result operation is
provided to the postprocessing module after this offset location in the data stream has already
passed through the postprocessing module, (for example, due to other instructions blocking further instruction processing), then this outdated remove_result operation will never trigger.
This will prevent the Crypto Packet Processor from completing the operation on the current
token, which will cause the Crypto Packet Processor to stop processing packets, until eventually
the timeout counter triggers and a result token with error code E14 (refer to “Result Token Definition” on page 429) is returned. Note that for this error to occur, the timeout counter must be
activated.
The remove_result operation is context sensitive in the sense that the behavior of the operation
changes depending on the context of the packet currently being processed. The context bit in
question is the encrypt hash result bit, bit [17] of context control word 1 from the packet
context. The reason for this different behavior is explained as follow:
Context Control Word1, bit 17, Encrypt Hash Result = ‘1’
If this bit is set to 1, the remove_result operation will remove an encrypted value from the data
stream. In this case, the operation is specifically meant for use with GHASH module (used in
AES-GCM/GMAC algorithms) or XCBC module (used in AES-CCM algorithm). Note that the
encrypt hash result bit may only be used in combination with one of these modes. Both
modes require the result of the hash operation to be XOR-ed with an encrypted value to generate
the actual ICV (integrity check value).
Therefore, in the case of a GHASH (AES-GCM) or XCBC (AES-CCM) with the
encrypt_hash_result bit set to ‘1’, the following instructions are used to generate the ICV.
1.
A remove_result instruction is used to schedule removal of the first block (16 bytes) of the
encrypted input data and storing it to internal registers, while the rest of the data stream is
processed and the final hash result is calculated. The instruction will be executed after
passing all packet data through the crypto engines.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
411
Appendix D: Inline Packet Engine
Note 1: The first encrypted block cannot be used for encrypting data since doing so would
open up the possibility of an attack on the nonce. Instead, this block is be used to encrypt the
final hash result.)
Note 2: The length field must be set to 16 (AES block size) for this mode to operate correctly.
2.
The following INSERT instruction is used to pass the first encrypted block to the post processor. This operation inserts 16 bytes of zeros (AES block size) and performs an XOR with the
first encrypted block in the data stream. This results in passing the first encrypted block
directly to the Crypto Packet Processor postprocessor, for later encrypting (XOR’ing with) the
hash result.
3.
Once the final hash result is computed, it is encrypted (XOR-ed with the internally stored
encrypted block of zeroes). This XOR operation is enabled by the encrypt hash result
bit.
4.
The result digest can be used by subsequent VERIFY or INSERT instructions as usual.
Context Control Word1, bit 17, Encrypt Hash Result = ‘0’
In this case the remove_result operation is used to extract the hash result from the inbound
(to be decrypted) data stream, after decryption by the packet engine. This is only relevant for protocols that specify the hash result to be encrypted after the plaintext payload data, such as SSL
or TLS.
Note: SSL and TLS require hash-then-encrypt for packet processing, automatically encrypting the
hash result, unlike encrypt-then-hash operation for the IPSec outbound ESP transform.
IRR Instruction (remove_result Operation)
31 30 29 28
27 26 25 24 23 22 21 20 19
18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
L
NH CS length
STAT
P
offset in the output data stream
0
0
–
0
–
1
0
1
0
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-1. IRR Instruction remove_result Operation Definition
412
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L
26
NH
The REMOVE_RESULT instruction requires the ‘L’, ‘NH’, and ‘CS’ fields
to be set to ‘0’.
25
CS
24:19
length
The "length" on 395 field indicates the number of bytes to be removed;
this can be 12, 16, 20 or 32 bytes. If the length field is 0 (6’b00_0000): 64
bytes will be removed
18:17
STAT
Refer to Table D-2, “Instruction Format,” on page 394 for a description of
this field
16
P
The REMOVE_RESULT instruction requires the ‘P’ field to be set to ‘0’.
15:00
offset in output data
stream
Removal starts at location specified by offset value in this field.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
INSERT_REMOVE_RESULT Instruction (insert_result Operation)
This section describes the INSERT_REMOVE_RESULT (IRR) instruction functioning as an
insert_result operation. In the Crypto Packet Processor configurations this instruction only be
used to append data to the packet.
The insert_result operation is typically used to update:
1.
length, next header, and checksum fields in IPv4 or IPv6 headers in the output data
stream, when processing an inbound transport mode ESP packet
2.
hash result field in AH headers in the output data stream, when processing an outbound
mode AH packet
For inbound tunnel mode packets, header updates are not required as the full header is already
present inside the ESP payload (see the reference list of RFC in “References” on page 577 for
details on AH/ESP tunnel/transport modes).
Note: The name of this instruction can be somewhat misleading as the instruction does not add
data to the output data stream, but replaces existing data. It should not be confused with
the REPLACE instruction, which operates on packet data in input data stream, before the
data is submitted to the packet processing module, or the INSERT instruction, that actually
does add data to the input data stream. The insert_result operation strictly operates on the
output data stream, after packet processing.
IRR Instruction (insert_result Operation) General Format
31 30 29 28
27 26 25 24 23 22 21 20 19
18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
L
NH CS length
STAT
P
offset in output stream
–
–
–
1
–
1
0
1
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-1. IRR Instruction General Format Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L
If set to ‘1’, indicates that the packet header ‘length’ field needs to be updated to
match the actual, processed, packet.
26
NH
If set to ‘1’, indicates that the IPv4/IPv6 packet header ‘Protocol’/’Next Header’ field
needs to be updated to match the actual, processed, packet.
25
CS
If set to ‘1’, indicates that the packet header Checksum field needs to be updated to
match the actual, processed, packet.
24:19
length
Indicates the hash result length or the relative location of the next header field or
the relative location of the checksum field, if no NH is available. A length of 0
(6’b00_0000) is not allowed when indicating a relative location for next header or
checksum.
18:17
STAT
See Table D-2, “Instruction Format,” on page 394 for a description of this field.
16
P
The insert_result operation requires this field to be set to ‘1’.
15:00
offset in output
stream
Indicates the location of the hash result in the packet or the location of the length
field (byte pointer). If bits [15:1] are all set to ‘1’ the data is always appended to the
end of the packet, rather than inserted (see Note 2).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
413
Appendix D: Inline Packet Engine
Note 1:
The insert_result operation should only be used for updating data in (as opposed to
after) the output data stream. If data must be appended due to output buffer overflow, or if
instructed by the ToO field in the token command word, then this is done aligned to a 32-bit
boundary. Also this is done after the packet and the applicable field in the result token are
set, even if the packet data did not end on a 32-bit boundary.
Appending (as opposed to updating) hash result data after the data stream (possibly
misaligned), according to the applicable protocol, should be done using an INSERT or
REPLACE instruction (for example, IPSec ESP) with the t.o.dest. field set to output
only.
Note 2:
These instructions with the exception of remove_result operation must always be
located after the instruction Types 1 and 2.
Examples of the INSERT_REMOVE_RESULT Instruction (insert_result Operations)
This section provides examples of the INSERT_REMOVE_RESULT (IRR) instruction uses to perform the following specific insert_result operations.
1.
Insert hash result.
2.
Insert length, next header, and checksum.
3.
Insert modified length and next header with checksum modification.
4.
Insert modified length and next header without checksum modification.
5.
Insert next header and checksum.
6.
Insert modified length and checksum.
7.
Insert modified length with or without checksum modification.
8.
Insert next header with or without checksum modification.
9.
Insert modified length.
10. Insert checksum.
Where:
length = IPv4 total packet length
modified length = length value modified to equal IPv6 payload length
checksum = update IPv4 packet header checksum field
checksum modification = update internal checksum register only
IRR Instruction Example (Insert Hash Result Operation)
This operation example inserts the hash result (if available) at the location indicated by the
offset in output stream field. The presence of this instruction in the token requires that a
previous instruction with the hash done bit of the STAT field to be set — STAT[0] (bit [17] of a
preprocessing instruction). Otherwise, an error will be generated since the hash operation was not
completed. Note that if no hash operation was specified for the current packet, this operation will
insert zeros in the data stream.
This insert_result operation example is typically used to insert an AH header ICV field in the
case of outbound mode.
414
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
IRR Instruction Example (Insert Hash Result Operation)
31 30 29 28
27 26 25 24 23 22 21 20 19
18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
L
NH CS length
STAT
P
offset in output stream
0
0
–
1
–
1
0
1
0
0
–
–
–
–
–
–
1
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-1. IRR Instruction Example Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L
This operation requires the ‘L’, ‘NH’, and ‘CS’ fields to be set to ‘0’.
26
NH
25
CS
24:19
length
The "length" on 395 field indicates the length of the inserted hash result.
Note: When updating (inserting or appending) a hash result, a length of 0
(6’b00_0000) results in an update of 64 bytes.
18:17
STAT[0]
This bit must be set to ‘1’. Since this operation does not modify the packet
header, it does not update the internally calculated header checksum. Therefore, the STAT[0] must be set to ‘1’, indicating that no checksum update is
required as a result of this operation.
16
P
This operation requires this field to be set to ‘1’.
15:00
offset in INput
stream
The location in the output data stream to insert the hash result. If all bits in this
field are set to 1, the hash result is appended to the packet. This appending is
32-bit aligned, regardless of the byte alignment at the end of the packet.
Note: Figure D-20 applies to packets that are less than 1792 bytes in length. For larger packets, the
12-byte ICV field is appended to the packet by default, in which case Figure D-21 applies.
Byte alignment within 32-bit words
Bit 31
Bit 0
B3
B3
B7
B11
B2
B2
B6
B10
B1
B1
B5
B9
B0
B0
B4
B8
IP Header
AH Header
Packet Data
Hash Result (ICV) Calculated by Crypto Packet Processor and
Inserted in the AH Header ICV Field (Starting with Byte 0)
Figure D-20: IPv4 Inserted Field - insert_hash_result Operation - AH Header ICV Field
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
415
Appendix D: Inline Packet Engine
Byte alignment within 32-bit words
Appended Data
ICV
Bit 31
Bit 0
B3
B3
B7
B11
B2
B2
B6
B10
B1
B1
B5
B9
B0
B0
B4
B8
IPv4 Header
AH Header
Packet Data
Hash Result (ICV) Calculated by Crypto Packet Processor and
(Starting with Byte 0)
Figure D-21: IPv4 Appended Field – insert_hash_result_operation – AH Header IVC Field
IRR Instruction Examples (Modifying IP Header Using insert_result Operations)
The insert_result operation can also be used to modify IP header fields: L, NH, and CS. The
following subsections present IRR instruction examples that modify the IP header using
insert_result operations. Since these example operations always modify the IP header, updating the internal checksum is always required, and is enabled by setting the STAT[0] bit to ‘0’ in
the instruction.
Insert_Result Operation Example (Insert Length, Next Header and Checksum)
This insert_result operation example is typically used for normal IPv4 header updates, in the
case of inbound ESP transport mode. In this example, each of the L, NH, and CS fields is enabled,
set to ‘1’. The operation assumes that the next header and checksum fields are in adjacent
fields in the IPv4 header. See Figure D-22. The checksum field immediately follows the protocol field, which is updated with the next header value.
This instruction inserts the actual packet length (calculated by the Crypto Packet Processor after
processing) at the location indicated by the offset in output stream field. The length field
indicates the positive offset relative to offset in output stream field where the next
header value is to be inserted (IPv4 protocol field); the checksum is inserted immediately
next to this location. The next header value used for the insertion is retrieved from the padding. See Figure D-24.
The inserted checksum is the result of the addition of the current checksum value (calculated so
far by the Crypto Packet Processor) plus length and next header, where next header is
added to the 8 MSbs of checksum. In the IPv4 header, the location of the checksum field (immediately adjacent to protocol field) and the way the protocol field must be added to
checksum, is fixed and corresponds to the order of these fields.
In IPv6, this operation updates only payload length and next header fields. The payload
length is inserted at the location pointed by the offset in output stream field. The
length field must reflect the actual size of the IPv6 packet header for this packet. The contents of
the length field are subtracted from the (Crypto Packet Processor internally calculated packet
length) in bytes to obtain the actual payload length to be inserted in the header. See
Figure D-27.
In IPv4 and IPv6, if all bits in the offset in output stream field are set to 1, the applicable
updated fields are appended to the packet. This appending is 32-bit aligned, regardless of the byte
alignment at the end of the packet.
416
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
IRR Instruction Example (Insert Length, Next Header, and Checksum)
In this example the inserted length is the total length. This is the result packet length
field in the result token. This must be equal the required value in the IPv4 total length field;
otherwise the wrong length will be inserted. Therefore, when inserting data in front of the IP
header, make sure that appropriate postprocessing of the result packet is done to correct the
length field (and checksum if needed) in the IP header.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
0
1
0
L
NH CS length
1
1
1
–
–
–
–
–
–
STAT
P
offset in output stream
–
1
–
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-2. IRR Instruction Example Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
L
This operation requires the ‘L’, ‘NH’, and ‘CS’ fields to be set to ‘1’.
26
NH
25
CS
24:19
length
The"length" on 395 field indicates the positive offset relative to ‘offset in output
stream’ field where the next header value is to be inserted; the checksum is
inserted immediately next to this location.
18:17
STAT[0]
This bit must be set to ‘0’, indicating that a checksum update is required as a
result of this operation.
16
P
This operation requires this field to be set to ‘1’.
15:00
offset in output stream
The location in the output data stream where the actual packet length calculated
by the Crypto Packet Processor is inserted after packet processing. If all bits in
this field are set to 1, appropriate fields are appended to the packet. See
Figure D-22 (IPv4) and Figure D-23 (IPv6).
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
417
Appendix D: Inline Packet Engine
4 Bits
32 Bits
verson
IHL
type of service
verson
total packet length
IHL
total packet length
protocol
time to live
header checksum
source address
destination address
option
payload
Figure D-22: IPv4 Header Datagram
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03
Length byte 1 (LSB)
Length byte 0 (MSB)
type of service
version
frag offset part 1
flags
identifier byte 1
identifier byte 0
checksum byte 1
checksum byte 0
protocol
time to live
source address byte 3
source address byte 2
source address byte 1
source address_0
dest. address byte 3
dest. address byte 2
dest. address byte 1
dest. address byte 0
frag off. part 0
02 01 00
IHL
4 Bits
32 Bits
verson
traffic class
total packet length
verson
next header
hop limit
source address
destination address
payload
Figure D-23: IPv6 Header Datagram
418
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
flow label
nibble 3
flow label
nibble 4
flow label
nibble 1
flow label
nibble 2
traffic class
nibble 1
flow label
nibble 0
version
traffic class
nibble 0
hop limit
next header
payload length byte 1
(LSB)
payload length byte 0
(MSB)
source address byte 3
source address byte 2
source address byte 1
source address byte 0
source address byte 7
source address byte 6
source address byte 5
source address byte 4
source address byte 11
source address byte 10
source address byte 9
source address byte 8
source address byte 15
source address byte 14
source address byte 13
source address byte 12
dest. address byte 3
dest. address byte 2
dest. address byte 1
dest. address byte 0
dest. address byte 7
dest. address byte 6
dest. address byte 5
dest. address byte 4
dest. address byte 11
dest. address byte 10
dest. address byte 9
dest. address byte 8
dest. address byte 15
dest. address byte 14
dest. address byte 13
dest. address byte 12
Byte alignment within 32-bit words
‘protocol’ field updated with instruction ‘next header’ value
Bit 31
B3
B2
total
length
checksum
Packet Data
P
B1
Bit 0
B0
IPv4 Header
Start value = instruction ‘length’ field = 6’h07
Start value = instruction ‘offset in output stream’ field = 16’h0002
Figure D-24: IPv4 Updated Fields - insert_result Operation - ‘L’, ‘NH’, ‘CS’ and ‘P’ Fields Selected
Note: Figure D-24 applies to packets that are less than 1792 bytes in length. For larger packets,
updated fields are appended to the packet by default, in which case Figure D-25 applies.
Byte alignment within 32-bit words
Appended Data
Bit 31
B3
B2
16’b00
16’b00 16’b00
IPv4 Header
Packet Data
B1
Bit 0
total
length
B0
P
8’b0
checksum
Actual Packet Length Calculated by
Crypto Packet Processor
Update Checksum Value Calculated by
Crypto Packet Processor
Figure D-25: IPv4 Appended Updates - insert_result Operation - ‘L’, ‘NH’ and ‘CS’ Fields Selected
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
419
Appendix D: Inline Packet Engine
Note: Figure D-25 applies to packets that are equal to or larger than 1792 bytes in length.
Figure D-25 also applies to packets processed with an insert_result operation with the
instruction length field = 16’hFFFE. Appended updated data always starts on a 32-bit
aligned position. See Table D-4.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
reserved
reserved
Length byte 1 (LSB)
Length byte 0 (MSB)
reserved
reserved
protocol
reserved
reserved
reserved
checksum byte 1
checksum byte 0
Byte alignment within 32-bit words
Bit 31
B3
Bit 0
IPv6 Header
B2
NH
B1
payload
length
Packet Data
B0
Start value = instruction ‘offset in output stream’ field = 16’h0004
Figure D-26: IPv6 Updated Fields - insert_result Operation - ‘L’, ‘NH’, ‘P’ Fields Selected and ‘length’ = 0x28
Note: Figure D-26 applies to packets that are less than 1792 bytes in length. For larger packets,
updated fields are appended to the packet by default, in which case Figure D-27 applies.
Byte alignment within 32-bit words
Appended Data
Bit 31
B3
B2
16’b00
16’b00
IPv6 Header
Packet Data
B1
Bit 0
payload
length
B0
payload
length
NH
‘offset in output stream’ field = 16,hFFFE = append
to packet on a 32-bit boundary
Figure D-27: IPv6 Appended Updates - insert_result Operation - ‘L’, ‘NH’, ‘P’ Fields Selected and ‘length’ = 0x28
31 30 29 28 27 26 25 24 23 22
21
20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
reserved
reserved
payload length byte 1
(LSB)
payload length byte 0
(MSB)
reserved
reserved
reserved
Next header
The following insert-result operation examples are similar to this example with one of more
options disabled. The diagrams above are applicable to the following sections.
420
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
If multiple insert_result operations are used and data is appended to a packet, the order of
the appended data will be in the order of the operations in the token.
Insert_Result Operation Example
This field inserts a Modified Length and Next Header with Checksum Modification.
There is currently no actual “use case” for this operation.
This operation updates the payload length field of an IP packet header with the length of the
result packet. The payload length is inserted at the location pointed by the
offset in output stream field.
This operation updates the next header field of an IP packet header. The next header value
is retrieved from the padding and is inserted at the length number of bytes from the payload
length field.
This operation will update the internal checksum value but will not insert the checksum immediately next to the protocol field as is appropriate for IPv4 header. To insert the checksum
later, a separate insert_result operation must be used. Refer to “Context Control Instructions” on page 425, Insert_Result Operation Example (Insert Checksum).
Example:
To update the length and next header fields in the IPv4 header located in front of the packet,
the offset in output stream field value must be 16h’0002 and length field value must be
6h’07 as shown in Figure D-24.
Insert Modified Length and Next Header with Checksum Modification
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
0
1
0
L
NH CS length
1
1
0
–
–
–
–
–
–
STAT
P
offset in output stream
–
1
–
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Insert_Result Operation Example
This field inserts a modified length and Next Header w/o Checksum Modification.
This insert_result operation example is typically used to update an IPv6 header in the case of
inbound ESP transport mode.
This operation updates the payload length and next header fields of an IPv6 packet header.
The payload length is inserted at the location pointed by the offset in output stream
field. The length field must reflect the actual size of the IPv6 packet header for this packet. The
contents of the length field are subtracted from the (Crypto Packet Processor internally calculated packet length) in bytes to obtain the actual payload length value to be inserted in the
header. The next header value is retrieved from the padding and is inserted immediately after
the inserted payload length.
The internally calculated checksum is not updated, as indicated by STAT[0] bit set to 1.
Insert Modified Length and Next Header w/o Checksum Modification
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
0
1
0
L
NH CS length
1
1
0
–
–
–
–
–
–
STAT
P
offset in output stream
–
1
–
1
–
–
–
–
–
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
–
–
–
–
–
–
–
–
–
–
421
Appendix D: Inline Packet Engine
Insert_Result Operation Example (Insert Next Header and Checksum)
There is currently no actual “use case” for this operation.
This operation inserts the next header and checksum fields. The next header is inserted at
the location pointed by the offset in output stream field. The length field of the instruction indicates the positive offset to the checksum in the data stream. The next header value is
retrieved from the padding.
This operation updates the internally calculated checksum before inserting the checksum in the
data stream.
Insert Next Header and Checksum
31 30 29 28
27 26 25 24 23 22 21 20 19
18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
L
NH CS length
STAT
P
offset in output stream
0
1
–
1
–
1
0
1
0
1
–
–
–
–
–
–
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Insert_Result Operation Example (Insert Modified Length and Checksum)
There is currently no actual “use case” for this operation.
This operation updates the payload length and IPv6 packet header. The payload length is
inserted at the location pointed by the offset in output stream field. The length field
must reflect the actual size of the IPv6 packet header for this packet. The contents of the length
field are subtracted from the Crypto Packet Processor internally calculated packet length in bytes
to obtain the actual payload length value to be inserted in the header.
The inserted checksum is the result of the addition of the current checksum value plus inserted
modified length.
Insert Modified Length and Checksum
31 30 29 28
27 26 25 24 23 22 21 20 19
18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
L
NH CS length
STAT
P
offset in output stream
1
0
–
1
–
1
0
1
0
1
–
–
–
–
–
–
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Insert_Result Operation Example (Insert Modified Length w/ or w/o Checksum Modification)
This operation updates the payload length an IPv6 packet header. The payload length is
inserted at the location pointed by the offset in output stream field. The length field
must reflect the actual size of the IPv6 packet header for this packet. The contents of the length
field are subtracted from the (Crypto Packet Processor internally calculated packet length) in
bytes to obtain the actual payload length value to be inserted in the header.
If P is set 0, the modification is added to the postprocessing checksum, otherwise (P=1) the modification is added to the preprocessing checksum and is the postprocessing checksum
overwritten with the result.
The STAT[0]: field indicates if the internally calculated checksum is updated. If STAT[0]: = ‘0’,
the internal checksum register is updated. If STAT[0]: = ‘1’, the internal checksum register is
not updated.
422
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
Insert Modified Length w/ or w/o Checksum Modification
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
0
1
0
L
NH CS length
1
0
0
–
–
STAT
–
–
–
–
–
P
0/1 1
offset in output stream
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Insert_Result Operation Example (Insert Next Header w/ or w/o Checksum Modification)
This operation inserts a next header value at the location pointed by the offset in output
stream field. The length field of this instruction can be 0.
If P is set to 0, the modification is added to the postprocessing checksum, otherwise (P=1) the
modification is added to the preprocessing checksum and is the postprocessing checksum overwritten with the result.
The STAT[0]: field indicates if the internally calculated checksum is updated. If STAT[0]: = ‘0’,
the internal checksum register is updated. If STAT[0]: = 1, the internal checksum register is not
updated.
Insert Next Header w/ or w/o Checksum Modification
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
0
1
0
L
NH CS length
0
1
0
–
–
STAT
–
–
–
–
–
P
0/1 1
offset in output stream
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Insert_Result Operation Example (Insert Checksum)
This operation inserts checksum field. The checksum is inserted at the location indicated by the
offset in output stream field.
The inserted checksum is the result of the addition of the current checksum plus inserted fields
before this instruction.
Insert Checksum
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
0
1
0
L
NH CS length
0
0
1
–
–
–
–
–
–
STAT
P
offset in output stream
–
1
–
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
D.5.6.2 REPLACE_BYTE Instruction
The REPLACE_BYTE instruction overwrites one byte of data in the output data stream. The byte
used to overwrite is located in the instruction’s data byte field. The byte to be overwritten is
located by the offset in output stream field.
The REPLACE_BYTE instruction can also be used to append one byte of data in the output data
stream by setting the offset in output stream field to 0xFFFE. In this case, a full dword is
appended to the output data stream with the instruction’s data byte field located in bits [7:0] of
the dword, with dword bits [31:8] set to 0. Note also that the B bit in the result token will be set,
indicating that the data is appended.
Please note that in the Crypto Packet Processor configurations only the last option of the
REPLACE_BYTE instruction is available. The REPLACE_BYTE instruction can only be used to
append a single byte, because of the output buffer’s limited size.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
423
Appendix D: Inline Packet Engine
REPLACE_BYTE
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
0
data byte
1
1
–
–
–
–
–
–
–
–
R
STAT
R
offset in output stream
0
–
0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table D-3. REPLACE_BYTE Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27:20
data byte
The byte of data to be used to overwrite a byte in the output data stream or
appended to the data stream.
19
R
reserved. Set to 0.
18:17
STAT
Refer to Table D-2, “Instruction Format,” on page 394 for a description of
this field.
16
R
reserved. Set to 0.
15:00
offset in output stream
The offset to the data byte in the output data stream to be overwritten.
Refer to "opcode" for a description of this field.
D.5.6.3 Reserved Instructions
The opcodes for reserved instructions may not be used.
RESERVED
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
1
0
0
–
D.5.7
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Result Instructions
The VERIFY_FIELDS instruction is currently the only RESULT instruction.
D.5.7.1 VERIFY_FIELDS Instruction
This instruction verifies context record fields or other retrieved values against other retrieved or
calculated values. These values can be the hash result, padding, checksum, SPI, or
sequence number. Comparison of retrieved, calculated and/or context values can only be
assessed for inbound packets.
The VERIFY_FIELDS instruction must always be the last instruction sent after execution instructions. It cannot be followed by additional execution instructions. It can only be followed by a
Context Control Instruction.
This instruction is used for generating error codes E9 through E13. See Table D-8, “Error Codes,”
on page 430 for error code descriptions. If no VERIFY_FIELDS instruction is used, these error
conditions are not detected.
424
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
VERIFY_FIELDS
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
1
0
1
S
SP CS P
reserved
–
–
–
–
–
–
–
–
–
STAT
H
length
1
0
0
1
0
0
0
0
0
0
0
0
–
–
–
–
–
0
0
Table D-4. VERIFY_FIELDS Definition
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27
S
Sequence Number. Set this bit to ‘1’ to verify the sequence number. Refer to
“Sequence Number Check” on page 450.
26
SP
SPI. Set this bit to ‘1’ to verify the retrieved SPI.
25
CS
Checksum. Set this bit to ‘1’ to verify the checksum.
24
P
Padding. Set this bit to ‘1’ to verify padding. (Set to ‘1’ only if the padding type
allows verification.)
22:19
reserved
Reserved. These bits should be set to ‘0’.
18:17
STAT
Status. This field must always be set to ‘11’.
16
H
Hash. Set this bit to ‘1’ to verify the hash result.
15:00
length
Length. Indicates the number of hash result bytes that needs to be compared,
valid values are:
0001100
12 bytes
0010000
16 bytes
0010100
20 bytes (SHA-1 and SHA-2 only)
0011000
24 bytes (SHA-2 only)
0011100
28 bytes (SHA-2 only)
0100000
32 bytes (SHA-2 only)
0110000
48 bytes (SHA-2 only)
1000000
64 bytes (SHA-2 only)
D.5.8
Context Control Instructions
The CONTEXT_ACCESS instruction is currently the only Context Control instruction supported.
D.5.8.1 CONTEXT_ACCESS Instruction
The CONTEXT_ACCESS instruction is used to update fields in the external context record. The
offset field is an external relative offset pointing to the base address of the context record. The
origin field a second offset pointing to the internal field(s) that need to be updated.
In IPSec, this instruction is typically used to update the sequence number of the context record,
and if it is an inbound packet, the sequence number mask fields. Optionally, the IV fields can
also be updated using a separate CONTEXT_ACCESS instruction.
The CONTEXT_ACCESS instruction is only executed after all Crypto Packet Processor processing
completes (preprocessing, packet engine processing and postprocessing), unless both the Fail and
Pass (F and P) fields are set to 0, in which case the CONTEXT_ACCESS instruction can be executed
at any time.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
425
Appendix D: Inline Packet Engine
Using the F and P fields (together referred to as the result type field) the CONTEXT_ACCESS
instruction can be selectively executed if a packet passes or fails. If the result type field is set
to 01, the instruction will be executed if the packet passed. If the result type field is set to 10,
the instruction will be executed if the packet failed.
CONTEXT_ACCESS
31 30 29 28
27 26 25 24
23 22 21 20 19
18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
result
type
opcode
1
1
length
1
0
–
–
origin
–
–
–
–
–
–
–
STAT
res
–
0
–
0
U
F
P
D
reserved
offset
–
–
–
–
0
–
0
0
–
–
–
–
–
–
–
data (optional), max = 1 dword
Table D-5. CONTEXT_ACCESS
Bits
Name
Description
31:28
opcode
Refer to "opcode" for a description of this field.
27:24
length
The "length" on 395 field indicates the number of dwords that need to be
transferred.
23:19
origin
The ‘origin’ field a offset pointing to the internal context record field(s) that
need to be updated. All internal registers can be read for insertion into the
external context record. Note that only general-purpose, IV, sequence number (and mask) internal registers are writable using this instruction. Typically,
only read actions are required to update the external context record.
Note that in Table D-6, some of the internal registers are ordered to allow
multiple fields to be inserted with one instruction. For example, the registers
highlighted in yellow (sequence number result and sequence number mask,
or IV0 through IV3) are typically written to the external context record with
one instruction.
426
18:17
STAT
= 11 – last instruction (optionally set)
16:15
res
Reserved, must be set to 00.
14
U
Use token data when processing this instruction. There can only be one
dword of data immediately following the token instruction. The ‘origin’ field
must be set to 11011 and "length" on 395 field must be 0001 in this case.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Processing Instructions
Table D-5. CONTEXT_ACCESS (continued)
Bits
Name
Description
13
F
12
P
Fail and Pass bit combinations. Together, these bits are also referred to as
the ‘result type’ field.
00
Execute the CONTEXT_ACCESS instruction immediately. When
both bits are set to ‘00’, this instruction can be executed anytime,
even before packet processing.
01
Execute the CONTEXT_ACCESS instruction only if all Crypto
Packet Processor processing has completed successfully, (without
errors).
10
Execute the CONTEXT_ACCESS instruction only if all Crypto
Packet Processor processing has completed unsuccessfully. In this
case the VERIFY_RESULT instruction resulted in one of more of the
following errors: error codes E9 through E13; see Table D-8 for
error
code descriptions. Executing the CONTEXT_ACCESS instruction in
these cases could be useful in debugging or keeping statistical data.
11
Execute the CONTEXT_ACCESS instruction only if all Crypto
Packet Processor processing has completed, successfully or
unsuccessfully.
11
D
Direction. In the Crypto Packet Processor this bit must be set to 1. This indicates that the CONTEXT_ACCESS instruction results in the Crypto Packet
Processor updated the external context record, also know as a ‘context write’
operation to the external context record.
Note: In standard (protocol) scenarios the ‘context read’ operation (when the
D-bit is set to 0) is not used. In the Crypto Packet Processor configurations
only ‘context write’ operations are allowed. The ‘context read’ operations
should not be started — only sequential fetches from the context input FIFO
that are initiated by the Crypto Packet Processor itself.
10:8
reserved
Reserved bits must be set to 0.
7:00
Offset
The (32-bit word) offset in the context record.
Table D-6. ‘Origin’ Field Encoding for CONTEXT_ACCESS Instruction
‘origin’ field value
Context Record Fields
00000
reserved
00001
reserved
00010
reserved
00011
reserved
00101
sequence number
00110
sequence number mask (length can be 2 or 4)
00111
reserved (sequence number mask 2nd word)
01000
reserved (sequence number mask 3th word)
01001
reserved (sequence number mask 4th word)
01010
sequence number
01011
extended sequence number
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
427
Appendix D: Inline Packet Engine
Table D-6. ‘Origin’ Field Encoding for CONTEXT_ACCESS Instruction (continued)
‘origin’ field value
1
428
Context Record Fields
01100
sequence number mask (length can be 2 or 4)
01101
reserved (sequence number mask 2nd word)
01110
reserved (sequence number mask 3th word)
01111
reserved (sequence number mask 4th word)
10000
general purpose register 0
10001
general purpose register 1
10010 … 10011
reserved
10100
IV0
10101
IV1
10110
IV2
10111
IV3
11000
hash result count
11001
ARC4 IJ-pointer
11010
ARC4 state record length is “don’t care” and will be forced to 64 (32-bit words) offset is don’t
care and will be ignored:
External DMA address is ARC4-state pointer from context record
11011
from token (see U, bit 15)
11100
hash result digest (length can be 4, 5, 8 or 161)
11101
reserved
11110
reserved
11111
reserved
If the selected origin field is hash result digest (11100), a length of 16 exceeds the width of the length field. To achieve
an update with length 16 (for SHA512): a length of 0’ must be selected in the instruction. This results in a transfer length of
16 (32-bit) words.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Result Token Definition
D.5.9
Bypass Token Data – Special Instruction
It is possible append token data to the result token without observation by the packet engine. This
special instruction can be used to bypass data from the input token through to the result token. If
used, it must always be the last instruction.
This data stream must start with four bits set to 1 (opcode). As a result, only 28 bits are available
in the first dword. The maximum length is four dwords, including the opcode bits.
BYPASS_TOKEN_DATA
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
opcode
1
1
data
1
1
data (28 bits only)
data (optional), maximum of 3 dwords beyond the initial 28 bits above
D.6 Result Token Definition
A result token is generated by the Crypto Packet Processor for every input token processed. The
result token contains the following fields resulting from packet processing.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
reserved
bypass data
E15
H
E0
E1
E3
E2
E4
hash bytes
E5
E10
B
E6
E11
C
E7
E12
N
E8
E13
L
E9
E14
result packet length
output packet pointer
reserved
pad length
next header field
(IPSec padding only)
bypass token data (optional: maximum of 4 words)
Two errors in the result token have a different behavior in the Crypto Packet Processor
configurations:
•
In the standard Crypto Packet Processor, error E0 occurs when either the pre-processor detects
a wrong length (instruction lengths, version packet length) or if the input DMA fetch reports
an error. The latter can not happen in the Crypto Packet Processor, however a similar case is
detected and reports the same error instead. If data_in_done is asserted before the amount
of written words matches with the packet length field from the input token, the Crypto Packet
Processor will generate an E0 error.
•
An E15 error cannot occur in the Crypto Packet Processor, and is therefore always 0.
Note: Error E14 should never occur in the Crypto Packet Processor; this bit indicates a situation
where the timeout counter “fires”. The timeout counter runs if no data movement within
the Crypto Packet Processor is detected and fires after this situation has persisted for a
fixed number of clock cycles (approximately 1000 clock cycles). The actual number of clock
cycles is dependent on the Crypto Packet Processor core configuration.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
429
Appendix D: Inline Packet Engine
Table D-7. Result Token Definition
Bits
Name
Description
31
E14
30
E13
29
E12
Error Codes. Table D-8 describes the different error codes returned by the
Crypto Packet Processor. It is possible to have multiple error codes in one
packet. Note that if one of more fatal error (errors highlighted in yellow)
occurs, the packet will not be processed correctly and should be dropped. A
fatal error only occurs if the input token or context record is incorrect.
28
E11
27
E10
26
E9
26
E8
24
E7
23
E6
22
E5
21
E4
20
E3
19
E2
18
E1
17
E0
16:00
Result Packet Length
The result packet length equals the length of the packet that is written out of
the Crypto Packet Processor, not including the appended result fields (see to
the ‘packet info fields’ in result token).
4
E15
Error Codes. Table D-8 describes the different error codes returned by the
Crypto Packet Processor. It is possible to have multiple error codes in one
packet. Note that if one of more fatal error (errors highlighted in yellow)
occurs, the packet will not be processed correctly and should be dropped. A
fatal error only occurs if the input token or context record is incorrect.
Table D-8. Error Codes
430
Error Codes
Description
E0
Packet length error: token instructions versus input or input DMA fetch.
E1
Token error, unknown token command/instruction.
E2
Token contains too much bypass data.
E3
Cryptographic block size error (ECB, CBC).
E4
Hash block size error (basic hash only).
E5
Invalid command/algorithm/mode/combination.
E6
Prohibited algorithm.
E7
Hash input overflow (basic hash only).
E8
TTL / HOP-limit underflow.
E9
Authentication failed.
E10
Sequence number check failed / Sequence number roll-over detected.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Result Token Definition
Table D-8. Error Codes (continued)
Error Codes
Description
E11
SPI check failed.
E12
Checksum incorrect.
E13
Pad verification failed.
E14
Time-out - FATAL ERROR, (see note below).
E15
Output DMA error.
Packet Info Fields – (H, hash byte, B, C, N, and L)
The following fields are collectively called the Packet Info Fields: H, hash byte, B, C, N, and L.
If a packet exceeds 1792 bytes (the output packet buffer size: 2048 minus 256), the result data must
be appended to the output packet in the output buffer before the packet is actually written out.
For example, updating the IPv4 header fields after stripping the padding (length field, protocol, and checksum).
For each header field, one dword is appended to the packet. Possible header fields are:
•
Result packet length (IPv4 inbound) or payload length (IPv6 inbound).
•
Next header field, retrieved from de-padding (IPv4 or IPv6 inbound).
•
A re-calculated checksum after replacing the length and next header (IPv4 inbound).
•
The hash result, the number of bytes indicated by hash bytes field (AH outbound)
•
Generic byte(s).
The “packet info fields” are comprised of the following fields:
•
H: Hash byte(s) appended (as a result of an insert_result instruction); the number of
appended bytes is indicated by the hash bytes field.
•
hash bytes: The number of appended hash bytes. Please refer to the insert_result
instruction in “IRR Instruction Example (Insert Hash Result Operation)” on page 414).
•
B: generic byte(s) appended (as a result of a REPLACE_BYTE instruction).
•
C: checksum appended (as a result of an insert_result instruction).
•
N: next header field appended (as a result of an insert_result instruction).
•
L: length field appended (as a result of an insert_result instruction).
bypass data
This field indicates the length of the result token bypass data in dwords.
output packet pointer
This is a direct copy of the ‘output packet pointer’ from the input token. Refer to “Output Packet
Pointer (token dword [2], Required)” on page 391 for details.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
431
Appendix D: Inline Packet Engine
next header
The next header field contains the next header result value intended for updating the IP
header, specifically, the IPv4 ‘protocol’ field or IPv6 next header field. This value is retrieved
during de-padding. The next header field is only applicable to IPSec padding; otherwise, this
field is set to 0 (8’h00).
pad length
The pad length field contains the number of detected (and removed) padding bytes. Applicable padding types are PKCS#7, RTP, IPSec, TLS and SSL. Otherwise, this field is set to 0 (8’h00).
D.7 Pre and Post-Processing by Host Software
D.7.1
Preprocessing
Incoming packet must be preprocessed in order to generate a proper token for the Crypto Packet
Processor. During preprocessing the following operations should be considered and executed
when applicable.
432
•
FCS: Verify the 32 bit CRC of Ethernet frame.
•
Ethernet hdr: Assure that the destination MAC address in the Ethernet header matches with the
device’s MAC address.
•
Ethernet hdr: Assure that the protocol type specified in the Ethernet header is either IPv4
(0x0800) or IPv6 (0x86dd).
•
Ethernet hdr: Store the size of the Ethernet header in order to insert the exact Ethernet length
when removing the Ethernet header later.
•
IPv4 hdr: If the Ethernet Type field indicates IPv4, verify that the version field in the IPv4 hdr
has the value 4. If not 4, then drop and do not send to Crypto Packet Processor
•
, since packet is obviously corrupt.
•
IPv4 hdr: Examine the IHL field to determine if IPv4 options are available.
•
IPv4 hdr (optional): For ECN (Explicit Congestion Notification) the ‘Type of Service’ field could
be copied from inner to outer header (and visa versa) for tunnel modes.
•
IPv4 hdr: The packet length field should be stored in order to pass proper payload lengths to
several Crypto Packet Processor
•
token commands
•
IPv4 hdr: The protocol field should be stored to insert into IPv4chksum commands
•
IPv4 hdr: The checksum should be validated before passing the packet to the Crypto Packet
Processor. The IPv4chksum command can update the checksum, but does not verify its correctness.
•
IPv4 hdr (AH transport only): Determine what options are mutable according to AH.
•
IPv6 hdr: If the Ethernet Type field indicates IPv6, verify that the version field in the IPv6 hdr
has the value 6. If it is not 6, then drop it and do not send it to Crypto Packet Processor, since
the packet is obviously corrupt.
•
IPv6 hdr (optional): For Explicit Congestion Notification (ECN) the Traffic Class field could be
copied from the inner to the outer header (and visa versa) for tunnel modes.
•
IPv6 hdr: The payload length field should be stored in order to pass proper payload lengths to
several Crypto Packet Processor token commands.
Tile Processor and I/O Device Guide for the TILE-Gx Family of Processors
Tilera Confidential — Subject to Change Without Notice
Context Record Definition
•
IPv6 hdr: Next header field should be stored to insert into IPv6 command and/or insert commands.
For an extensive set of example tokens, refer to document: Appendix E: “Inline Packet Engine —
Token Examples” on page 491.
D.7.2
Post-Processing
D.7.2.1 Result Token
After the Crypto Packet Processor finishes processing a token, a result token is generated. If no
errors occurred during processing, output data will also be available.
D.7.2.2 Appended Data
When the Crypto Packet Processor has finished processing a token, data is sometimes appended
to the packet. When there is appended data, the result token provides information on what is
appended, however it does not indicate in what order data is appended.
Three instruction cause different append bits to be set in the result token. These are:
•
The RESULT instruction with L, NH, CS bits not set -> hash bytes defined
•
The RESULT instruction with L, NH and/or CS bits set -> L, NH and/or CS bit set
•
The REPLACE instruction with undefined data to append -> B bit set
The order data is appended to the output data is the order the instructions were provided in the
token. For the RESULT instruction with L, NH and/or CS bits set, the appended data order is
always Length-NH-checksum.
D.7.2.3 Suggested Post-Processing Operations
During post-processing the following operations should be considered and executed when
applicable.
•
Result token: verify that no error bits are set.
•
Result token: test if Length (L bit) has been appended. If it has been appended, replace the
length in the IPv4 / IPv6 header with the appended length. In the case of IPv6, bear in mind
that the length in the IPv6 header is the payload length and not the packet length.
•
Result token: test if the next header (NH bit) has been appended. If so, replace the length
in the IPv4- or IPv6-header.
•
Result token: test if the checksum bit has been set and therefore a 16-bit checksum has
been appended. The checksum should replace the checksum in the IPv4 header
•
Result token: next header field: fo