ppt - NCSU COE People - North Carolina State University

advertisement
Experiences with Two
FabScalar-based Chips
Elliott Forbes, Rangeen Basu Roy Chowdhury,
Brandon Dwiel, Anil Kannepalli, Vinesh Srinivasan,
Zhenqian Zhang, Randy Widialaksono, Thomas Belanger,
Steve Lipa, Eric Rotenberg, W. Rhett Davis, Paul D. Franzon
Electrical and Computer Engineering – North Carolina State University
Introduction
• FabScalar
– Synthesizable and parameterized RTL for OOO
superscalar cores
• Fetch and issue widths, structure sizes
– No memory hierarchy or “uncore” (until next release)
• Two FabScalar-based research projects
– H3 (“Heterogeneity in 3D”)
• Two cores with different microarchitectures
• Hardware support for thread migration
– AnyCore
• One core with reconfigurable microarchitecture
1
Electrical and Computer Engineering – North Carolina State University
Goals
• Technical
– Explore adaptivity: Ability to adjust microarchitecture
to current instruction-level behavior
• Migrate program execution to more suitable core (H3)
• Reconfigure core (AnyCore)
• Non-technical
– Fulfill original vision of FabScalar
• Streamline development of single-ISA
heterogeneous multi-core processors
– Experience realities of fabricating designs
– Have fun building stuff
2
Electrical and Computer Engineering – North Carolina State University
H3 Overview
• Two stacked asymmetric cores
• Fast Thread Migration (FTM)
– Bulk swap of arch. register state
• Cache-Core Decoupling (CCD)
– Cores may switch L1 caches
at thread migrations
“top” core
“bottom”
core
fetch width
2
1
issue width
3 (alu, br, ld-st)
3
IQ
32
16
LQ/SQ
16/16
16/16
ROB
64
32
PRF
96
64
L1 I$
4KB DM
4KB DM
L1 D$
8KB 4-way
8KB 4-way
Face-to-face 3D bonding provides dense low-latency interconnect
• Two phases
– Phase 1: 2D IC (completed testing in June 2015)
• Test cores, caches, compiled memories, migration logic, etc.
– Phase 2: 3D IC (August 2015 tapeout)
• Demonstrate stacked out-of-order cores, benefits of
heterogeneity, etc.
3
Electrical and Computer Engineering – North Carolina State University
H3 Design
Timeline - Design and Verification
9 months til RTL
freeze
2012
Jan Feb Mar Apr May Jun Jul
2013
Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun
Most effort on caches, buses,
new features.
Brandon: cores
generate dual hetero. cores with FabScalar
dust off in-order scalar core (cut from tapeout)
in-order core checked-in to SVN repo ↑
FabScalar-generated cores checked-in to SVN repo
L1 instruction cache
modify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB
I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializer
cache-core decoupling (CCD) for both I- and D-caches
synthesis and scan chain insertion
RTL freeze on "allcores" (Sep. 19, 2012)
delivered synthesized netlist with scan chains inserted
regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)
post-RTL-freeze verification (RTL simulation)
post-tapeout verification (P&R netlist simulation)
hetero. multi-core research (C++ simulator, power models, etc.)
Rangeen: L1 data cache
study OpenSparc T2 L1 data cache
integrate T2 D$ into in-order core for early design and testing
integrated in-order core + D$ checked-in to SVN repo
misc. mods to T2 D$ (remove TLB, reset strategy, etc.)
redesign per-thread MHSRs for single thread miss-under-miss (MLP)
modify core's load/store lane for 2-cycle MEM stage
integrated OOO cores + D$ checked-in to SVN repo
debug
Elliott: core-side FTM, perf. counters, debug core
global migration
perf. counters
local migration
integrate with face-to-face bus controller
debug core: start debug core from 2-wide core
debug core: replace L1 I/D caches with I/D scratchpads
debug core: replace memory bus with scan I/O
Zhenqian: face-to-face bus for FTM
Randy: flow, physical design
missed tapeout: Tezzaron 3D run deferred indefinitely
download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem)
physical design (floorplan, P&R, DRC, LVS, power integrity)
missed tapeout: couldn't close on DRC, LVS
tapeout (May 20, 2013)
4
 FabScalar saved effort.
Two cores generated with FabScalar
I-cache: in-house
↑
Modify Fetch-1 for synch. R/W compiled memories
↑
↑
Chip I/O (mem. buses, serializer/deserializer)
research (culminated in data for ICCD-31 paper)
D-cache: retool and integrate OpenSparc-T2 L1 D$.
↑
Original plan (canceled): leverage T2’s 8 core x 8 bank
crossbar and L2 $ implemented in stacked DRAM.
↑
New features:
CCD
perf. counters
FTM
↑
↑
↑
↑
Electrical and Computer Engineering – North Carolina State University
H3 Design
Timeline - Design and Verification
2012
Jan Feb Mar Apr May Jun Jul
2013
Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun
Brandon: cores
generate dual hetero. cores with FabScalar
dust off in-order scalar core (cut from tapeout)
in-order core checked-in to SVN repo ↑
FabScalar-generated cores checked-in to SVN repo
L1 instruction cache
modify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB
I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializer
cache-core decoupling (CCD) for both I- and D-caches
synthesis and scan chain insertion
RTL freeze on "allcores" (Sep. 19, 2012)
delivered synthesized netlist with scan chains inserted
regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)
post-RTL-freeze verification (RTL simulation)
post-tapeout verification (P&R netlist simulation)
hetero. multi-core research (C++ simulator, power models, etc.)
Rangeen: L1 data cache
study OpenSparc T2 L1 data cache
integrate T2 D$ into in-order core for early design and testing
integrated in-order core + D$ checked-in to SVN repo
misc. mods to T2 D$ (remove TLB, reset strategy, etc.)
redesign per-thread MHSRs for single thread miss-under-miss (MLP)
modify core's load/store lane for 2-cycle MEM stage
integrated OOO cores + D$ checked-in to SVN repo
debug
Elliott: core-side FTM, perf. counters, debug core
global migration
perf. counters
local migration
integrate with face-to-face bus controller
debug core: start debug core from 2-wide core
debug core: replace L1 I/D caches with I/D scratchpads
debug core: replace memory bus with scan I/O
Zhenqian: face-to-face bus for FTM
Randy: flow, physical design
missed tapeout: Tezzaron 3D run deferred indefinitely
download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem)
physical design (floorplan, P&R, DRC, LVS, power integrity)
missed tapeout: couldn't close on DRC, LVS
tapeout (May 20, 2013)
5
 Physical design.
↑
↑
↑
research (culminated in data for ICCD-31 paper)
physical design data (phase 1)
Technology
IBM 8RF (130 nm)
Dimensions
5.25 mm x 5.25 mm
Area
27.6 mm2
Transistors
14.6 Million
Cells
1.1 Million
Nets
721 Thousand
Memory macros 56
Clock domains
10
Pads
400 (100 for each of
four experiments)
↑
↑
6 mo.
phys.
design
↑
tool execution time
Encounter PAR 9 hours
Calibre DRC
4.5 hours
Calibre LVS
2 hours
↑
↑
↑
Electrical and Computer Engineering – North Carolina State University
H3 Design
 Mitigating risk.
Dedicated memory buses for the two cores
• Avoid a potential single point of failure
Parameterized memory bus width
• Reduce schedule risk: Early pad planning is important but fluid.
Full scan in 1-wide core
• Observability/controllability of at least one core with caches
• 2-wide core doesn’t have scan overhead
Debug Core
Rationale:
•
•
•
•
Test a “pure” FabScalar core
Plan B, in case two-core-stack doesn’t work
Eliminate risky aspects
Enhance testability/debuggability
Core configuration:
• Same configuration as 2-wide core
Key features:
Die photo +
6
• I-cache and D-cache replaced with synthesized I and D scratchpads
• No compiled memories
• No complex caches
• Full scan
• Observability/controllability for debug
floorplan
• No memory buses: Scratchpads preloaded/examined via scan chains
Electrical and Computer Engineering – North Carolina State University
H3 Verification
• RTL verified using SPEC2K SimPoints
– In retrospect, should have also used microbenchmarks
• Lesson: Budget enough time for netlist verification
– Major effort to set up netlist simulation
• Testbench and debug more complicated (everything blasted into individual nets)
• SDF annotation requires experience
• Most issues caused by testbench and SDF problems
– Found serious, but not fatal, bug just after tapeout
• A difference between RTL and netlist caused by misplaced `ifdef
• `ifdef SIM guards instrumentation in the RTL. Thus, SIM is defined in testbench
but not in synthesis script. The problem is that a small real code fragment was
also mistakenly guarded by it.
• Lessons: (1) Consolidate all instrumentation in testbench (none in RTL).
(2) Do netlist verification, because netlist may not equal RTL.
– Netlist simulation also would have alerted us to hold-time violations in D$
• OpenSparc T2 D$ is a heavily latch-based industry design
• Problem encountered in chip bring-up, diagnosed with netlist simulation
7
Electrical and Computer Engineering – North Carolina State University
H3 Packaging, PCB, & Bring-up
• Debug core uses only a dozen signal pins
– Allowed us to wirebond a die directly to an
existing board to check debug core liveness
– Test Vdd/Gnd and V+/Gnd for shorts
– Scan-in == Scan-out
• Success of debug core liveness tests
was the green light to assemble the four
configurations. For each configuration:
Chip-on-board debug core liveness
– Package the chip (wirebonding and lead-forming)
– Design and fab a 4-layer PCB
– Assemble the PCB
Wirebond Configurations
Experiment
Signal/Supply
Pads
Hetero. core pair
63/37
Debug core
13/87
Isolated F2F bus
62/38
Isolated F2B bus
49/51
• Overall chip has 400 pads divided into four 100-pad
experiments
– Wirebond chip differently for each experiment
– Allows for use of a 128-pin QFP package
8
Electrical and Computer Engineering – North Carolina State University
H3 Packaging, PCB, & Bring-up
Test setup
All signals go to both:
• PCB connects to LPC mezzanine
• Headers (to oscilloscope)
of Xilinx ML605
• LPC connector (to FPGA)
• FPGA handles memory requests
from cores
– Block RAMs for L2 cache
• Host PC sends commands to
FPGA via serial interface using a
custom GUI
• Custom compiler for writing
microbenchmarks
– For good control of instruction
selection and order, without assembly
programming
Fully assembled H3 PCB (Phase 1)
Layer 1 (shown): package, headers
Layers 2,3: Vdd/V+, Gnd
Layer 4 (underneath): LPC, DCAPs
9
Electrical and Computer Engineering – North Carolina State University
H3 Packaging, PCB, & Bring-up
• Results of chip bring-up
– Identified 9 total issues
– 3 setup, 1 “feature”, 4 bugs, 1 possible bug
– See Table 3, #x
• 3 setup issues: #2, #3, #6 (fixed)
• 1 “feature” of extra I$ bus traffic: #4
(no ill effects, but may want to fix in Phase 2)
• 4 bugs (will fix in Phase 2)
– 1 bug detected post-tapeout, pre-silicon: #1
(serious, but has workaround)
– 1 class of hold-time bugs in D$: #5
(serious, fortunately top core ok with certain tags)
– 2 bugs exercised by thread migrations: #7, #8
(just annoying, have workarounds)
• 1 possible bug when migrating with CCD enabled: #9
(debug in progress)
10
Electrical and Computer Engineering – North Carolina State University
H3 Packaging, PCB, & Bring-up
Microbenchmark
PRNG
pseudo-random
number generator
PRIMES
prime number
generator
ARRAY-SUM
sum elements of
an array
BUBBLE
bubble sort,
list initially
reverse sorted
Core
Static
instr.
1-wide
2-wide
Dynamic Cycles
instr.
67.1M
28
67.1M
28
88.2M
35
41.9M
1-wide
2-wide
1-wide
2-wide
1.00
Current Energy
(mA)
(mJ)
8.53†
59.6
64.4M
1.04
5.78
46.5
100.4M
0.88
8.35†
87.3
301.8M
0.29
5.70
215.0
6.54
17.1
Did not finish
20.9M
2.00
Did not finish
1-wide
2-wide
IPC
73
2.5M
8.5M
0.29
5.24
5.57
MIGRATE
“migrate” instr.
inside loop
Comments
Peak IPC achieved.
IPC>1 due to branches.
IPC not much greater than 1 because only one
simple/complex ALU lane, and load/store lane unused.
OOO tolerates 14-cycle integer divide instruction well.
Hypothesis: larger IQ and non-age-based scheduler
exacerbates priority inversion, stalling retirement more.
See Table 3, issue #5.
Peak IPC achieved.
See Table 3, issue #5.
Early diagnosis: SQ stalls dominate other resource stalls.
Hypothesis: Limited store buffer size of write-through D$
causing back-pressure. Frequent swaps imply many
consecutive stores. Write-through latency to FPGA is high.
Completes 1 million consecutive thread migrations
correctly.
Cores at different frequencies.
~1 migration per 1K cycles.
† 1-wide core has higher current (and power) than 2-wide core because it has full scan.
11
Electrical and Computer Engineering – North Carolina State University
H3 Phase 2
• Scheduled tapeout in August 2015
• Just the two-core stack
– No debug core
– No scan chains
• Design tasks
–
–
–
–
Partition RTL for two tiers
Implement thread migration enhancements
Fix bugs
Replace T2 D-caches with in-house D-caches
• T2 not workable: explicit latch instances, everywhere
• T2 no longer needed: shelved crossbar and stacked DRAM L2$
• We now have in-house D-cache from AnyCore effort
– Custom-design M1 pads and F2F bondpoints
12
Electrical and Computer Engineering – North Carolina State University
AnyCore Overview
• FabScalar evolution
– Released FabScalar:
• Build-up cores from library of stage designs of different widths
– Next-gen FabScalar:
• “Superset Core”: single Verilog description with
parameterized widths (structure sizes already parameterized)
• AnyCore derived from “superset core”
– Keep static configurability of superset core to allow for
synthesis of different max sized AnyCore processors
– Add dynamic configurability within max size
• AnyCore instance that was fabricated:
13
Adaptive microarchitecture feature
fetch/dispatch width (instructions/cycle)
issue width (instructions/cycle)
physical register file & active list
load and store queues (each)
issue queue
Configurations
1, 2, 3, 4
3, 4, 5
64, 96, 128
16, 32
16, 32, 48, 64
Electrical and Computer Engineering – North Carolina State University
AnyCore Design and Verification
Notable Design Features
• In-house L1 cache designs with three modes
– Cache
– Scratchpad
– BIST (mode after processor reset)
• Scratchpad mode with first N rows preset to a test program, including a new
instruction that toggles a dedicated pin.
• Debug interface for direct reading/writing key pipeline structures,
scratchpads, core configuration registers, and performance counters
Applied Lessons from H3 Project
• Automatic liveness test (BIST)
– Upon applying power, clock, and reset, chip should toggle a dedicated
pin. This signals it is correctly executing the preset test program.
• Got netlist simulation working early
– Both post-synthesis (ideal clocks, estimated SDF) and post-layout
(everything realistic)
14
Electrical and Computer Engineering – North Carolina State University
AnyCore Packaging, PCB, & Bring-up
• Applied more lessons from H3 project
– Used a socket instead of soldering package to PCB
• Possible benefit that PCB may be repurposed for other projects
• Replace defective chips
• Study variations
– Used a dedicated
bypassable shunt
resistor to measure
current
– Narrowed PCB to LPC
profile. Compatible with
arbitrary Xilinx boards
(ML605 and Zynq boards).
Received
Intermediate checks
Packaged chips
from MOSIS
Examined chip orientation, wirebonds, and downbonds under microscope.
Checked lead-forming to socket specification.
Sockets from Agile
Checked fit of package in socket under microscope.
Vdd, V+, and Gnd checks with chip in socket.
Bare PCB from B.B.
Checked all connections.
Assembled PCB
from B.B.
Repeated Vdd, V+, and Gnd checks with chip in socket.
BIST: successful retirement of first instructions within 24 hours of receiving
assembled
PCB. (Toggle Pin observed April 9!)
Electrical and Computer Engineering – North Carolina State University
15
Discussion and Questions
• Thank you
• Any comments or
questions?
Fully assembled H3 PCB
Fully assembled AnyCore PCB
The H3 project is supported by a grant from Intel. The AnyCore project is supported by NSF grant CCF-1018517.
Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do
not necessarily reflect the views of the National Science Foundation.
16
Electrical and Computer Engineering – North Carolina State University
Backup
Electrical and Computer Engineering – North Carolina State University
References
N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H.
Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. FabScalar: Composing
Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar
Template. Proceedings of the 38th IEEE/ACM International Symposium on
Computer Architecture (ISCA-38), pp. 11-22, June 2011.
E. Rotenberg, B. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, R. Basu Roy
Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and P. D. Franzon. Rationale
for a 3D Heterogeneous Multi-core Processor. Proceedings of the 31st IEEE
International Conference on Computer Design (ICCD-31), pp. 154-168,
October 2013.
Electrical and Computer Engineering – North Carolina State University
H3 Detailed Pre-tapeout Timeline
Timeline - Design and Verification
2012
Jan Feb Mar Apr May Jun Jul
2013
Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun
Brandon: cores
generate dual hetero. cores with FabScalar
dust off in-order scalar core (cut from tapeout)
in-order core checked-in to SVN repo ↑
FabScalar-generated cores checked-in to SVN repo
L1 instruction cache
modify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB
I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializer
cache-core decoupling (CCD) for both I- and D-caches
synthesis and scan chain insertion
RTL freeze on "allcores" (Sep. 19, 2012)
delivered synthesized netlist with scan chains inserted
regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)
post-RTL-freeze verification (RTL simulation)
post-tapeout verification (P&R netlist simulation)
hetero. multi-core research (C++ simulator, power models, etc.)
Rangeen: L1 data cache
study OpenSparc T2 L1 data cache
integrate T2 D$ into in-order core for early design and testing
integrated in-order core + D$ checked-in to SVN repo
misc. mods to T2 D$ (remove TLB, reset strategy, etc.)
redesign per-thread MHSRs for single thread miss-under-miss (MLP)
modify core's load/store lane for 2-cycle MEM stage
integrated OOO cores + D$ checked-in to SVN repo
debug
Elliott: core-side FTM, perf. counters, debug core
global migration
perf. counters
local migration
integrate with face-to-face bus controller
debug core: start debug core from 2-wide core
debug core: replace L1 I/D caches with I/D scratchpads
debug core: replace memory bus with scan I/O
Zhenqian: face-to-face bus for FTM
Randy: flow, physical design
missed tapeout: Tezzaron 3D run deferred indefinitely
download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem)
physical design (floorplan, P&R, DRC, LVS, power integrity)
missed tapeout: couldn't close on DRC, LVS
tapeout (May 20, 2013)
↑
↑
↑
research (culminated in data for ICCD-31 paper)
↑
↑
↑
Electrical and Computer Engineering – North Carolina State University
↑
↑
↑
H3 Detailed Post-tapeout Timeline
Timeline - post-tapeout
2013
2014
Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul
2015
Aug Sep Oct Nov Dec Jan Feb Mar Apr
Steve: packaging, PCBs
bare die back from MOSIS (Sep. 6, 2013) ↑
debug core: wirebond chip-on-board
debug core: package chip; design, fab, and assemble PCB
hetero. core pair: package chip; design, fab, and assemble PCB
Elliott: chip bring-up
develop compiler for controlled benchmark creation <<July 2013>>
debug core: Vdd/Gnd short test, current test, scan test
debug core: retire millions of instructions
debug core: simple power and freq. measurements
setup host PC and FPGA infrastructure
hetero. core pair: first instructions retired
hetero. core pair: synthesize full testbench to FPGA
hetero. core pair: 10M+ instr. retired (no memory ops)
hetero. core pair: on-going testing and debug
hetero. core pair: breakthrough - tracked down flaky reset issue
research
research
Thomas: scan chain scripts (debug core), chip bring-up
↑
↑
↑
ICCD-32 - Design-Effort Alloy (DEA)
Electrical and Computer Engineering – North Carolina State University
ICCD-32 revisions & pres.
other
Cool H3 Pictures
Final layout
Full test setup (host PC, power supply, ML605, etc.)
Electrical and Computer Engineering – North Carolina State University
H3 Errata
Symptom
Description
Stage
Workaround
1
Instructions that
depend on loads
will wake up early
when the load
misses (reading an
incorrect value from
the register
file/bypass).
A misplaced `ifdef guarded
key RTL during simulation,
but the `ifdef condition was
not enabled during synthesis.
post-tapeout
Prefetch blocks such that loads
will hit.
2
Sourcesynchronous clock
port errors in FPGA
testbench during
PAR/mapping.
Clock signals from the cores
to the FPGA were not routed
to clock-capable pins of the
LPC.
post-PCBassembly
Use a jumper from the clock
header of the PCB to a clockcapable pin of the XM105
debug card attached to the
HPC connector of the ML605.
3
Non-repeatability in
running single
threaded test
programs.
High precision ammeter used post-silicon
to measure logic current.
testing
Current draw of running core
caused the ammeter to switch
shunt resistance (with a relay)
immediately after reset was
de-asserted. During the
switch, the supply as seen by
the core dropped below
threshold voltage, causing
metastability.
Use dedicated shunt resistor,
and measure voltage drop
across that resistor instead of
the in-line ammeter.
Electrical and Computer Engineering – North Carolina State University
H3 Errata
4
Symptom
Description
Stage
Workaround
Repeated I-cache
misses to the same
block.
- Frequently-executed and
mostly-taken branch in the last
slot of a cache block.
- A BTB update occurs when the
branch is being fetched (Fetch-1
stage), and the update is for this
branch. Thus, there is a BTB
write and a BTB read to the
same entry, that of the branch.
The BTB read indicates a miss
due to the concurrent read and
write. This causes Fetch-1 to
fetch the next sequential block,
generating an I-cache miss.
- The Fetch-1 stage is redirected
to the branch’s taken target in
the next cycle, when the branch
is predecoded in the Fetch-2
stage.
- Meanwhile, the I-cache miss
request for the sequential block
is not canceled. Further, the
retrieved block is dropped by the
core if the Fetch-1 stage is
fetching instructions.
- End result: The same miss (of
the sequential block) is
generated repeatedly.
post-silicon
testing
This issue does not impact
performance, but does make it
difficult to measure the number
of actual I-cache misses. It is
possible to add NOPs to
ensure frequently-executed
and mostly-taken branches are
not in the last slot of a cache
block.
Electrical and Computer Engineering – North Carolina State University
H3 Errata
5
Symptom
Description
Stage
Workaround
Cache-missed load
never retires.
After block is retrieved from
memory, netlist simulation
shows the tag not being
written to the correct set (set
index goes to x’s), owing to a
hold-time violation in the set
index for the tag fill. The core
replays a cache-missed load
until it hits. After the MHSR
performs the flawed line-fill,
the replayed load will miss
again, generating another
miss request for the same
block. This repeats
indefinitely.
post-silicon
testing
Top Core: Hold-time violation
on fill path seems to affect only
the tag fill, not the data fill.
Further, it appears that
uninitialized tags in the tag
SRAM are, by chance, 0.
Thus, limiting load and store
addresses to have tags of 0
masks the flawed tag fill. The
workaround is not guaranteed
but does seem very stable.
Bottom Core: Not only does
the load’s tag need to be 0, but
we also had to lower Vdd to
slow circuit paths. With these
two measures, the replayed
load hits and retires. However,
the replayed load does not
appear to get the correct data,
presumably due to the data fill
path being compromised
(conjecture).
Electrical and Computer Engineering – North Carolina State University
H3 Errata
Symptom
Description
Stage
Workaround
6
Inability to
read/write TRF
registers.
Attempted read/write of TRF
registers in a single-thread
experiment. The reset of the
F2F controller is fed through
clock-synchronizing flip-flop
pairs, but the clock of the F2F
controller was not running
(since it was a single thread
experiment).
post-silicon
testing
Ensure F2F controller is
clocked when either core is
clocked, even when migrations
are not being tested.
7
Execution deadlock
after a thread
migration.
The BTB is implemented in
SRAM, including its valid bits,
so valid bits are not reset
after a migration. This
caused BTB hits on noncontrol instructions. False
BTB hits are detected and
recovered in Fetch-2 stage,
except CTIQ is still pushed.
Those instructions would
allocate entries in the CTIQ,
but during retirement, the
CTIQ entry was not deallocated. The CTIQ would
fill, and fetch would stall
indefinitely.
post-silicon
testing
Carefully craft test programs
such that instruction addresses
do not overlap.
Hard reset tends to work, but
not guaranteed.
Electrical and Computer Engineering – North Carolina State University
H3 Errata
Symptom
Description
Stage
8
Repeated thread
migrations only
work for fewer than
33 migrations.
The MIGRATE instruction is
post-silicon
in a loop. If the loop-ending
testing
branch is fetched before the
MIGRATE instruction retires,
then a CTIQ entry will be
allocated for the branch.
However, after the migration,
the CTIQ is not reset, so the
allocated entries are not deallocated, eventually filling the
CTIQ, blocking forward
progress.
Add instructions (NOPs if
necessary) after a MIGRATE
instruction to guarantee no
branches can be fetched
before the MIGRATE retires.
9
Unable to
repeatedly migrate
from core-to-core
when CCD is
enabled.
Debugging in progress
Unknown
post-silicon
testing
Workaround
Electrical and Computer Engineering – North Carolina State University
Ammeter Shunt Resistor Switching
• High precision meter still uses shunt
resistor
– Our meter switches shunt resistor using relay
– The relay causes a lag between changes in core current draw,
and the ability of the ammeter to react
DCM
Power Supply
1
2
3
4
1. FPGA reset pressed (resets
core, which then draws near
zero current).
2. Ammeter switches to higher
resistance
Coreshunt resistor
3. FPGA reset released (core starts
executing, increasing current
draw).
4. Ammeter switches back to lower
resistance shunt resistor
Electrical and Computer Engineering – North Carolina State University
Ammeter Burden Voltage
• Using a discrete, fixed value, resistor to measure
current also has trade-offs
– Increasing resistance gives more accurate measurements
of voltage drop (and hence current)
– However, that same voltage drop also lowers the voltage
as seen by the core
– This is why the ammeter switches shunt resistances in the
first place
DVM
Voltage
regulated
here
Power Supply
Core
Electrical and Computer Engineering – North Carolina State University
Power Supply Sense Input
• You could bump the voltage of the power supply up so that
the voltage as seen by the core is within spec
• But then during reset, the core current drops to near zero, and
the voltage as seen by the core is roughly the same as the
bumped up output of the power supply
• A more fully featured power supply, with a sense input, can
help this situation… regulation is at the sense node, not the
output of node of the power supply
Voltage
DVM
Power Supply
Sense
regulated
here
Core
Electrical and Computer Engineering – North Carolina State University
Custom Compiler
• Needed a way to explicitly control all aspects of
emitted instructions
• Language syntax is assembly with a few higher-level
features
– Registers and memory locations can be named
– if statements and while loops
– Arithmetic operators, assignments, address-of
• Syntax allows the ability to place code/data at arbitrary
memory locations
– Including non-contiguous locations
• The compiler also emits to our checkpoint file format
• Written in flex/bison
Electrical and Computer Engineering – North Carolina State University
Custom Compiler Example
% example program that
% sums the values of an array
mem (0x00400000) {
ii:
$r1
addr: $r2
cond: $r4
val:
$r5
total: $r6
addr = @data
total = #0
ii = #0
cond = ii < #4
while (cond) {
lw val, #0[addr]
addr = addr + #4
total = total + val
ii = ii + #1
cond = ii < #4
}
% example program array data
mem (0x00100000) {
data:
!0x0f0f0f0f
!0xabcd1234
!0x00000001
!0xdeadbeef
result:
!0x00000000
}
addr = @result
sw total, #0[addr]
}
Electrical and Computer Engineering – North Carolina State University
Intel 80386 Errata (it’s not just us!)
• Hardware Peculiarity: Unless pin F13 of the 80386 is connected to
the +5V power supply, the 80386 never terminates a memory cycle,
halting the processor.
– Datasheet indicates pin F13 is NC with a note that “Pins identified as
‘NC’ should remain completely unconnected.”
• Successive Floating-Point Instructions: If two floating-point
instructions are executed close together, the 80386 may force the
coprocessor to start the second one too soon if the first one did not
require any memory operands.
• Misaligned Floating-Point Instructions: If 80287 and/or 80387
instructions are not word-aligned, the 80386 passes the wrong
instruction to the coprocessor, causing unpredictable behavior.
• Self-test: The self-test feature does not work on the A1 stepping of
the 80386.
• Double Page Faults: The bug that appeared in the B0 stepping
regarding page faults that occur during page faults has been made a
permanent feature of the 80386…
Turley, James L., Advanced 80386 Programming Techniques, McGraw-Hill, Berkeley, CA, 1988.
Electrical and Computer Engineering – North Carolina State University
AnyCore Detailed Pre-tapeout Timeline
Timeline - Design and Verification
2013
Jan Feb Mar Apr May Jun Jul
2014
Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul
Aug Sep Oct Nov Dec
Rangeen: 6-wide core
Add microarchiterular adaptivity to superset
Add dynamic reconfiguration control
Beef up testbench to support dynamic reconfiguration tests
Add debug facility and configuration registers
RTL functional and performance verification and debug
synthesis, fine grain clock gating and scan chain insertion
RTL feature freeze
post-synthesis verification (synthesized netlist simulation)
post-layout verification (P&R netlist simulation)
UPF based power aware design and synthesis flow
UPF based power aware netlist simulation and power analysis
Rangeen: 4-wide core
Add coarse grain clock gates - partition level
RTL functional and performance verification and debug
synthesis and scan chain insertion
post-synthesis verification (synthesized netlist simulation)
post-layout verification (P&R netlist simulation)
UPF based power aware design and synthesis flow
UPF based power aware netlist simulation and power analysis
Rangeen: L1 caches
Non blocking write through data cache
Instruction cache
Miss handling and off chip communication machinery
integrate and debug
Add scratch mode and BIST to caches
Change cache sizes and off-chip bus widths
Anil: perf. counters, debug access and verification
Add perf counters
Read and write PRF via debug bus
Read AMT via debug bus
Verify perf counter read and reset
Verify debug access to PRF and AMT
Post synthesis verification of debug features
Verify read/write access to caches via debug bus
Rangeen: physical design
Modify H3 flow to suit AnyCore requirements
physical design (floorplan, P&R, DRC, LVS)
missed tapeout (May 23, 2014): couldn't close on timing - design too dense
tapeout (August 22, 2014)
Vinesh: scan chain verification - post layout
Script to generate bit sequence from executable
Add scan support in testbench
Verify basic scan functionality
Flow
Final
↑
research
Electrical and Computer Engineering – North Carolina State University
research
↑
↑
AnyCore Physical Design Details
Physical design data
Technology
IBM 8RF (130nm)
Dimensions
5 mm x 5 mm
Area
25 mm2
Pads
100
(signal, power)
(79, 21)
Transistors
3.4 million
Cells
1.5 million
Nets
7.6 million
Electrical and Computer Engineering – North Carolina State University
AnyCore Detailed Post-tapeout Timeline
Timeline - Post Tapeout
2014
2015
Oct Nov Dec Jan Feb Mar Apr
Rangeen/Anil: packaging
Package bonding diagram
Socket selection: dictates lead trim
Packaging specifics : lead trim, downbond, fill
Packaging validation after receiving packaged parts
Anil: PCB design
PCB planning and knowledge transfer
Socket footprint creation - consult with PCB manufacturer for design rule
Rev 1A- minimum required components and headers
Rev 1B- debug features, silk screen markings, headers, power measurement
Design review and detailed connectivity check with FPGA and headers
submit final PCB design to Better Boards (April 1)
test bare board at Better Boards (April 7)
pick up assembled boards from Better Boards (April 8)
Rangeen/Anil/Elliott: chip bring-up
insert chip in socket, check Vdd/V+/Gnd, attach to FPGA, BIST success! (April 9)
Electrical and Computer Engineering – North Carolina State University
↑
↑
↑
↑
Cool AnyCore Pictures
Final layout
Floorplan
Electrical and Computer Engineering – North Carolina State University
AnyCore Pipeline
Electrical and Computer Engineering – North Carolina State University
BIST Program
0x00
0x08
0x10
0x18
0x20
0x28
0x30
0x38
0x40
0x48
0x50
0x58
0x60
0x68
0x70
0x78
addi r1, r0, #0
addi r2, r0, #0
addi r3, r0, #0
addi r4, r0, #0
toggle
nop
nop
nop
st r3(#0), r4
addi r2, r2, #10
addi r1, r1, #5
addi r2, r2, #10
toggle
addi r1, r1, #5
addi r2, r2, #10
addi r1, r1, #5
0x80
0x88
0x90
0x98
0xa0
0xa8
0xb0
0xb8
0xc0
0xc8
0xd0
0xd8
0xe0
0xe8
0xf0
0xf8
addi r4, r0, #0
ld r4, r3(#0)
addi r1, r1, #5
addi r2, r2, #10
toggle
addi r2, r2, #10
addi r4, r4, #2
addi r3, r3, #4
addi r2, r2, #10
addi r1, r1, #5
addi r1, r1, #5
jmp 0x40
nop
nop
nop
nop
Electrical and Computer Engineering – North Carolina State University
Debug Bus
Caches / AMT /
PRF
Write Data / Wr En
Read/Write Addr
Configuration
Registers
Performance
Counters
Debug Registers
(PC, queue
pointers etc.)
Electrical and Computer Engineering – North Carolina State University
Read Data
Download