Tile Processors:
Many-Core for Embedded
and Cloud Computing
Richard Schooler
VP Software Engineering
Tilera Corporation
rschooler@tilera.com
Exploiting Natural Parallelism

High-performance applications have lots of
parallelism!
– Embedded apps:
• Networking: packets, flows
• Media: streams, images, functional & data parallelism
– Cloud apps:
• Many clients: network sessions
• Data mining: distributed data & computation

Lots of different levels:
– SIMD (fine-grain data parallelism)
– Thread/process (medium-grain task parallelism)
– Distributed system (coarse-grain job parallelism)
2
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Every one is going Manycore, but can the
architecture scale?
n
Cores
The computing world
is ready for radical
change
32
#Cores
Sun IBM Cell
Larrabee
Intel
4
Performance
Performance
&
Performance/W
Gap
2
1
2005
2006
3
2010
2014
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
2020
Time
Key “Many-Core” Challenges: The 3 P’s

Performance challenge
– How to scale from 1 to 1000 cores – the number of
cores is the new Megahertz

Power efficiency challenge
– Performance per watt is the new metric – systems are
often constrained by power & cooling

Programming challenge
– How to provide a converged many core solution in a
standard programming environment
4
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
“Problems cannot be solved by the same
level of thinking that created them.”

Current technologies fail to deliver
–
–
–
–

Incremental performance increase
High power
Low level of Integration
Increasingly bigger cores
We need to have a new thinking to get
–
–
–
–
10 x performance
10 x performance per watt
Converged computing
Standard programming models
5
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Stepping Back: How Did We Get Here?

Moore’s Conundrum:
More devices =>? More performance

Old answers: More complex cores; bigger
caches
– But power-hungry

New answers: More cores
– But do conventional approaches scale?

Diminishing returns!
6
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
The Old Challenge: CPU-on-a-chip
20 MIPS CPU
in 1987
Few thousand gates
7
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
The Opportunity: Billions of Transistors
Old
CPU:
What to do with all those transistors?
8
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Take Inspiration from ASICs
mem
mem
mem
mem
mem
ASICs have high performance and low power
• Custom-routed, short wires
• Lots of ALUs, registers, memories – huge on-chip parallelism
But how to build a programmable chip?
9
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Replace Long Wires with Routed
Interconnect
Ctrl
[IEEE Computer ’97]
10
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
From Centralized Clump of CPUs …
11
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
RF
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
ALU
ALU
ALU
ALU
ALU
R
ALU
… To Distributed ALUs, Routed Bypass
Network
Scalar Operand Network (SON) [TPDS 2005]
12
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
From a Large Centralized Cache…
13
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
…to a Distributed Shared Cache
14
ALU
ALU
ALU
ALU
ALU
R
ALU
$
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
[ISCA 1999]
Distributed Everything + Routed
Interconnect  Tiled Multicore
ALU
ALU
ALU
ALU
ALU
R
ALU
$
Each tile is a processor, so programmable
15
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Tiled Multicore Captures ASIC Benefits
and is Programmable



Scales to large numbers of cores
Modular – design and verify 1 tile
Power efficient
– Short wires plus locality opts –
CV2f
– Chandrakasan effect, more cores at
2
lower freq and voltage –
CV f
Processor
Core
Current Bus Architecture
S
Core + Switch = Tile
16
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Tilera processor portfolio
Demonstrating the scale of many-core
Gx64 & Gx100
Up to 8x performance
TILEPro64
TILE64
Gx16 & Gx36
2x the performance
TILEPro36
2008
2007
17
2009
2010
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
...
TILE-Gx100™:
4x GbE
SGMII
mPIPE
10 GbE
XAUI
PCIe 2.0
8-lane
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
PCIe 2.0
4-lane
Flexible
I/O
10 GbE
XAUI
4x GbE
SGMII
10 GbE
XAUI
MiCA
Memory Controller (DDR3)
18
Memory Controller (DDR3)
4x GbE
SGMII
10 GbE
XAUI
SerDes
SerDes
4x GbE
SGMII

80-120 Gbps packet I/O
– 8 ports XAUI / 2 XAUI
– 2 40Gb Interlaken
– 32 ports 1GbE (SGMII)
SerDes
10 GbE
XAUI
SerDes
Interlaken
4x GbE
SGMII
10 GbE
XAUI
PCIe 2.0
8-lane
Interlaken
SerDes
SerDes
SerDes
UART x2,
USB x2,
JTAG,
I2C, SPI
1.2GHz – 1.5GHz
 32 MBytes total cache
 546 Gbps peak mem BW
 200 Tbps iMesh BW


80 Gbps PCIe I/O
– 3 StreamIO ports (20Gb)
SerDes
10 GbE
XAUI
MiCA
SerDes
4x GbE
SGMII
Memory Controller (DDR3)
SerDes
Memory Controller (DDR3)
SerDes
Complete System-on-a-Chip with 100 64-bit cores

Wire-speed packet eng.
– 120Mpps

MiCA engines:
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
– 40 Gbps crypto
– compress & decompress
™
TILE-Gx36 :
SerDes

36 Processor Cores
866M, 1.2GHz, 1.5GHz clk
12 MBytes total cache
SerDes
Scaling to a broad range of applications

40 Gbps total packet I/O

4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
SerDes
PCIe 2.0
4-lane
PCIe 2.0
4-lane
Flexible
I/O
mPIPE
PCIe 2.0
8-lane
SerDes
SerDes
UARTx2,
USBx2,
JTAG,
I2C, SPI
10 GbE
XAUI
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
Memory Controller (DDR3)
10 GbE
XAUI
– 4 ports 10GbE (XAUI)
– 16 ports 1GbE (SGMII)

48 Gbps PCIe I/O
– 2 16Gbps Stream IO ports
SerDes
Memory Controller (DDR3)
SerDes

MiCA

Wire-speed packet engine
– 60Mpps

MiCA engine:
– 20 Gbps crypto
– Compress & decompress
19
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Full-Featured General Converged Cores

Processor
–
–
–
–
Each core is a complete computer
3-way VLIW CPU
SIMD instructions: 32, 16, and 8-bit ops
Instructions for video (e.g., SAD) and
networking
– Protection and interrupts

Memory
–
–
–
–



Core
Register File
Three Execution Pipelines
L1 cache and L2 Cache
Virtual and physical address space
Instruction and data TLBs
Cache integrated 2D DMA engine
Cache
16K L1-I
I-TLB
8K L1-D
D-TLB
Runs SMP Linux
Runs off-the-shelf C/C++ programs
Signal processing and general apps
20
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
2D
DMA
64K L2
Terabit
Switch
Software must complement the hardware
Enable re-use of existing code-bases

Standards-based development environment
– e.g. gcc, C, C++, Java, Linux
– Comprehensive command-line & GUI-based tools

Support multiple OS models
– One OS running SMP
– Multiple virtualized OS’s with protection
– Bare metal or “zero-overhead” with background OS environment

Support range of parallel programming styles
–
–
–
–
Threaded programming (pThreads, TBB)
Run-to-Completion with load-balancing
Decomposition & Pipelining
Higher-level frameworks (Erlang, OpenMP, Hadoop etc.)
21
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Software Roadmap

Standards & open source integration
– Compiler: gcc, g++ 4.4+
– Linux:
• Kernel: Tile architecture integrated to 2.6.36
• User-space: glibc, broader set of standard
packages

Extended programming and runtime
environments
– Java: porting OpenJDK
– Virtualization: porting KVM
22
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Tile architecture:
The future of many-core computing

Multicore is the way forward
– But we need the right architecture to utilize it

The Tile architecture addresses the challenges
– Scales to 100’s of cores
– Delivers very low power
– Runs your existing code

Standards-based software
– Familiar tools
– Full range of standard programming environments
23
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Thank you!
Questions?
Research Vision to Commercial Product
The future?
1B
Transistors
in 2007
MiCA
4x GbE
SGMII
(DDR3)
10 GbE
XAUI
4x GbE
SGMII
Interlaken
4x GbE
SGMII
4x GbE
SGMII
mPIPE
PCIe 2.0
4-lane
10 GbE
XAUI
10 GbE
XAUI
10 GbE
XAUI
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
Interlaken
SerDes
SerDes
PCIe 2.0
8-lane
MiCA
10 GbE
XAUI
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
Memory Controller
(DDR3)
Memory Controller
(DDR3)
10 GbE
XAUI
2010
Tile Processor
64 cores
Mem
1997
2007
2002
25
Memory Controller
Flexible
I/O
MIT Raw
16 cores
CPU
(DDR3)
PCIe 2.0
8-lane
SerDes
2018
1996
A blank
slate
Memory Controller
UART x2,
USB x2,
JTAG,
I2C, SPI
SerDes SerDes SerDes SerDes SerDes
TILE-Gx100
100 cores
100B
transistors
SerDes SerDes SerDes
The opportunity
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Standard tools and programming model
Multicore Development Environment
Standards-based tools
Standard programming



SMP Linux 2.6
ANSI C/C++
Java, PHP
Standard application stack
Application layer


Open source apps
Standard C/C++ libs
Operating System layer
Integrated tools



GCC compiler
Standard gdb gprof
Eclipse IDE



64-way SMP Linux
Zero Overhead Linux
Bare metal environment
Hypervisor layer
Innovative tools


Multicore debug
Multicore profile
26


Virtualizes hardware
I/O device drivers
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Applications
Applications
libraries
Operating System
kernel drivers
Hypervisor
Virtualization and high
speed I/O drivers
Tile Processor
Tile Tile Tile Tile …
Standard Software Stack
Management
Protocols
IPMI 2.0
Infrastructure
Apps
Transcoding
Language
Support
Compiler, OS
Hypervisor
27
Network
Monitoring
Perl
Commercial Linux
Distribution
gcc & g++
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
High single core performance
Comparable to Atom & ARM Cortex-A9 cores
Single-Core Single thread CoreMark™ Comparison
CoreMark Score
3,500
3,000
2,500
2,000
1,500
1,000
500
Tilera
TILEPro64
866 MHz
Tilera
TILE-Gx36
1.25 GHz
ARM
Cortex-A9
1 GHz
Intel
Atom N270
1600 MHz
- Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website http://coremark.org/home.php
- TILE-Gx and single thread Atom results were measured in Tilera labs
- Single core, single thread result for ARM is calculated based on chip scores
28
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Significant value across multiple markets
Networking
• Classification
• L4-7 Services
• Load Balancing
• Monitoring/QoS
• Security
Multimedia
• Video
Conferencing
• Media Streaming
• Transcoding
Wireless
• Base Station
• Media Gateway
• Service Nodes
• Test Equipment
Cloud
• Apache
• Memcached
• Web Applications
• LAMP stack
High Performance
Low Power
Standard Programming
Over 100 customers
Over 40 customers going into production
Tier 1 customers in all target markets
29
29
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Targeting markets with highly parallel
applications
Web
Media delivery
Government
Web Serving
In Memory Cache
Data Mining
Transcoding
Video delivery
Wireless media
Lawful interception
Surveillance
Other
Common Themes
Hundreds and Thousands of servers running each application
thousands of parallel transactions
All need better performance and power efficiency
30
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
The Tile Processor Architecture
Mesh interconnect, power-optimized cores
2 Dimensional mesh network
Processor
Core
S
Core + Switch = Tile
Scales to large numbers of cores
 Modular: Design-and-verify 1 tile
 Power efficient:

– Short wires & locality optimize CV2f
– Chandrakasan effect, more cores at
lower freq and voltage –
31
200 Tbps on-chip bandwidth
Traditional Bus/Ring Architecture
CV2f
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Distributed “everything”
Cache, memory management, connectivity

Big centralized caches don’t scale
– Contention
– Long latency
– High power

Distributed caches have numerous
benefits
– Lower power (less logic lit-up per access)
– Exploit locality (local L1 & L2)
– Can exploit various cache placement
mechanisms to enhance performance
32
Distributed caches
Completely
HW Coherent
Cache
System
Global 64-bit address space
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Highest compute density

2U form factor

4 hot pluggable modules

8 Tilera TILEPro processors

512 general purpose cores

1.3 trillion operations /sec
33
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Power efficient and eco-friendly server





10,000 cores in a 8 Kilowatt rack
35-50 watts max per node
Server power of 400 watts
90%+ efficient power supplies
Shared fans and power supplies
34
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Coherent distributed cache system
Globally Shared Physical Address Space
–
–
–
–

Each tile has local L1 and L2 caches
Aggregate of L2 serves as a globally shared L3
Any cache block can be replicated locally
Hardware tracks sharers, invalidates stale copies
Dynamic Distributed Cache (DDC™)
–

DDR2 Controller 0
Memory pages distributed across all cores or
homed by allocating core
Coherent I/O
–
–
–
–
PCIe 0
XAUI 0
Completely
Coherent
System
Flexible
I/O
UART
JTAG
SPI, I2C
PCIe 1
GbE 0
GbE 1
XAUI 1
DDR2 Controller 3
Hardware maintains coherence
I/O reads/writes coherent with tile caches
Reads/writes delivered by HW to home cache
Header/packet delivered directly to tile caches
35
DDR2 Controller 1
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
DDR2 Controller 2
SerDes
Distributed cache
SerDes

Full Hardware Cache Coherence
Standard shared memory programming model
SerDes
–
–
SerDes
