Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury, Brandon Dwiel, Anil Kannepalli, Vinesh Srinivasan, Zhenqian Zhang, Randy Widialaksono, Thomas Belanger, Steve Lipa, Eric Rotenberg, W. Rhett Davis, Paul D. Franzon Electrical and Computer Engineering – North Carolina State University Introduction • FabScalar – Synthesizable and parameterized RTL for OOO superscalar cores • Fetch and issue widths, structure sizes – No memory hierarchy or “uncore” (until next release) • Two FabScalar-based research projects – H3 (“Heterogeneity in 3D”) • Two cores with different microarchitectures • Hardware support for thread migration – AnyCore • One core with reconfigurable microarchitecture 1 Electrical and Computer Engineering – North Carolina State University Goals • Technical – Explore adaptivity: Ability to adjust microarchitecture to current instruction-level behavior • Migrate program execution to more suitable core (H3) • Reconfigure core (AnyCore) • Non-technical – Fulfill original vision of FabScalar • Streamline development of single-ISA heterogeneous multi-core processors – Experience realities of fabricating designs – Have fun building stuff 2 Electrical and Computer Engineering – North Carolina State University H3 Overview • Two stacked asymmetric cores • Fast Thread Migration (FTM) – Bulk swap of arch. register state • Cache-Core Decoupling (CCD) – Cores may switch L1 caches at thread migrations “top” core “bottom” core fetch width 2 1 issue width 3 (alu, br, ld-st) 3 IQ 32 16 LQ/SQ 16/16 16/16 ROB 64 32 PRF 96 64 L1 I$ 4KB DM 4KB DM L1 D$ 8KB 4-way 8KB 4-way Face-to-face 3D bonding provides dense low-latency interconnect • Two phases – Phase 1: 2D IC (completed testing in June 2015) • Test cores, caches, compiled memories, migration logic, etc. – Phase 2: 3D IC (August 2015 tapeout) • Demonstrate stacked out-of-order cores, benefits of heterogeneity, etc. 3 Electrical and Computer Engineering – North Carolina State University H3 Design Timeline - Design and Verification 9 months til RTL freeze 2012 Jan Feb Mar Apr May Jun Jul 2013 Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Most effort on caches, buses, new features. Brandon: cores generate dual hetero. cores with FabScalar dust off in-order scalar core (cut from tapeout) in-order core checked-in to SVN repo ↑ FabScalar-generated cores checked-in to SVN repo L1 instruction cache modify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializer cache-core decoupling (CCD) for both I- and D-caches synthesis and scan chain insertion RTL freeze on "allcores" (Sep. 19, 2012) delivered synthesized netlist with scan chains inserted regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF) post-RTL-freeze verification (RTL simulation) post-tapeout verification (P&R netlist simulation) hetero. multi-core research (C++ simulator, power models, etc.) Rangeen: L1 data cache study OpenSparc T2 L1 data cache integrate T2 D$ into in-order core for early design and testing integrated in-order core + D$ checked-in to SVN repo misc. mods to T2 D$ (remove TLB, reset strategy, etc.) redesign per-thread MHSRs for single thread miss-under-miss (MLP) modify core's load/store lane for 2-cycle MEM stage integrated OOO cores + D$ checked-in to SVN repo debug Elliott: core-side FTM, perf. counters, debug core global migration perf. counters local migration integrate with face-to-face bus controller debug core: start debug core from 2-wide core debug core: replace L1 I/D caches with I/D scratchpads debug core: replace memory bus with scan I/O Zhenqian: face-to-face bus for FTM Randy: flow, physical design missed tapeout: Tezzaron 3D run deferred indefinitely download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) physical design (floorplan, P&R, DRC, LVS, power integrity) missed tapeout: couldn't close on DRC, LVS tapeout (May 20, 2013) 4 FabScalar saved effort. Two cores generated with FabScalar I-cache: in-house ↑ Modify Fetch-1 for synch. R/W compiled memories ↑ ↑ Chip I/O (mem. buses, serializer/deserializer) research (culminated in data for ICCD-31 paper) D-cache: retool and integrate OpenSparc-T2 L1 D$. ↑ Original plan (canceled): leverage T2’s 8 core x 8 bank crossbar and L2 $ implemented in stacked DRAM. ↑ New features: CCD perf. counters FTM ↑ ↑ ↑ ↑ Electrical and Computer Engineering – North Carolina State University H3 Design Timeline - Design and Verification 2012 Jan Feb Mar Apr May Jun Jul 2013 Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Brandon: cores generate dual hetero. cores with FabScalar dust off in-order scalar core (cut from tapeout) in-order core checked-in to SVN repo ↑ FabScalar-generated cores checked-in to SVN repo L1 instruction cache modify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializer cache-core decoupling (CCD) for both I- and D-caches synthesis and scan chain insertion RTL freeze on "allcores" (Sep. 19, 2012) delivered synthesized netlist with scan chains inserted regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF) post-RTL-freeze verification (RTL simulation) post-tapeout verification (P&R netlist simulation) hetero. multi-core research (C++ simulator, power models, etc.) Rangeen: L1 data cache study OpenSparc T2 L1 data cache integrate T2 D$ into in-order core for early design and testing integrated in-order core + D$ checked-in to SVN repo misc. mods to T2 D$ (remove TLB, reset strategy, etc.) redesign per-thread MHSRs for single thread miss-under-miss (MLP) modify core's load/store lane for 2-cycle MEM stage integrated OOO cores + D$ checked-in to SVN repo debug Elliott: core-side FTM, perf. counters, debug core global migration perf. counters local migration integrate with face-to-face bus controller debug core: start debug core from 2-wide core debug core: replace L1 I/D caches with I/D scratchpads debug core: replace memory bus with scan I/O Zhenqian: face-to-face bus for FTM Randy: flow, physical design missed tapeout: Tezzaron 3D run deferred indefinitely download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) physical design (floorplan, P&R, DRC, LVS, power integrity) missed tapeout: couldn't close on DRC, LVS tapeout (May 20, 2013) 5 Physical design. ↑ ↑ ↑ research (culminated in data for ICCD-31 paper) physical design data (phase 1) Technology IBM 8RF (130 nm) Dimensions 5.25 mm x 5.25 mm Area 27.6 mm2 Transistors 14.6 Million Cells 1.1 Million Nets 721 Thousand Memory macros 56 Clock domains 10 Pads 400 (100 for each of four experiments) ↑ ↑ 6 mo. phys. design ↑ tool execution time Encounter PAR 9 hours Calibre DRC 4.5 hours Calibre LVS 2 hours ↑ ↑ ↑ Electrical and Computer Engineering – North Carolina State University H3 Design Mitigating risk. Dedicated memory buses for the two cores • Avoid a potential single point of failure Parameterized memory bus width • Reduce schedule risk: Early pad planning is important but fluid. Full scan in 1-wide core • Observability/controllability of at least one core with caches • 2-wide core doesn’t have scan overhead Debug Core Rationale: • • • • Test a “pure” FabScalar core Plan B, in case two-core-stack doesn’t work Eliminate risky aspects Enhance testability/debuggability Core configuration: • Same configuration as 2-wide core Key features: Die photo + 6 • I-cache and D-cache replaced with synthesized I and D scratchpads • No compiled memories • No complex caches • Full scan • Observability/controllability for debug floorplan • No memory buses: Scratchpads preloaded/examined via scan chains Electrical and Computer Engineering – North Carolina State University H3 Verification • RTL verified using SPEC2K SimPoints – In retrospect, should have also used microbenchmarks • Lesson: Budget enough time for netlist verification – Major effort to set up netlist simulation • Testbench and debug more complicated (everything blasted into individual nets) • SDF annotation requires experience • Most issues caused by testbench and SDF problems – Found serious, but not fatal, bug just after tapeout • A difference between RTL and netlist caused by misplaced `ifdef • `ifdef SIM guards instrumentation in the RTL. Thus, SIM is defined in testbench but not in synthesis script. The problem is that a small real code fragment was also mistakenly guarded by it. • Lessons: (1) Consolidate all instrumentation in testbench (none in RTL). (2) Do netlist verification, because netlist may not equal RTL. – Netlist simulation also would have alerted us to hold-time violations in D$ • OpenSparc T2 D$ is a heavily latch-based industry design • Problem encountered in chip bring-up, diagnosed with netlist simulation 7 Electrical and Computer Engineering – North Carolina State University H3 Packaging, PCB, & Bring-up • Debug core uses only a dozen signal pins – Allowed us to wirebond a die directly to an existing board to check debug core liveness – Test Vdd/Gnd and V+/Gnd for shorts – Scan-in == Scan-out • Success of debug core liveness tests was the green light to assemble the four configurations. For each configuration: Chip-on-board debug core liveness – Package the chip (wirebonding and lead-forming) – Design and fab a 4-layer PCB – Assemble the PCB Wirebond Configurations Experiment Signal/Supply Pads Hetero. core pair 63/37 Debug core 13/87 Isolated F2F bus 62/38 Isolated F2B bus 49/51 • Overall chip has 400 pads divided into four 100-pad experiments – Wirebond chip differently for each experiment – Allows for use of a 128-pin QFP package 8 Electrical and Computer Engineering – North Carolina State University H3 Packaging, PCB, & Bring-up Test setup All signals go to both: • PCB connects to LPC mezzanine • Headers (to oscilloscope) of Xilinx ML605 • LPC connector (to FPGA) • FPGA handles memory requests from cores – Block RAMs for L2 cache • Host PC sends commands to FPGA via serial interface using a custom GUI • Custom compiler for writing microbenchmarks – For good control of instruction selection and order, without assembly programming Fully assembled H3 PCB (Phase 1) Layer 1 (shown): package, headers Layers 2,3: Vdd/V+, Gnd Layer 4 (underneath): LPC, DCAPs 9 Electrical and Computer Engineering – North Carolina State University H3 Packaging, PCB, & Bring-up • Results of chip bring-up – Identified 9 total issues – 3 setup, 1 “feature”, 4 bugs, 1 possible bug – See Table 3, #x • 3 setup issues: #2, #3, #6 (fixed) • 1 “feature” of extra I$ bus traffic: #4 (no ill effects, but may want to fix in Phase 2) • 4 bugs (will fix in Phase 2) – 1 bug detected post-tapeout, pre-silicon: #1 (serious, but has workaround) – 1 class of hold-time bugs in D$: #5 (serious, fortunately top core ok with certain tags) – 2 bugs exercised by thread migrations: #7, #8 (just annoying, have workarounds) • 1 possible bug when migrating with CCD enabled: #9 (debug in progress) 10 Electrical and Computer Engineering – North Carolina State University H3 Packaging, PCB, & Bring-up Microbenchmark PRNG pseudo-random number generator PRIMES prime number generator ARRAY-SUM sum elements of an array BUBBLE bubble sort, list initially reverse sorted Core Static instr. 1-wide 2-wide Dynamic Cycles instr. 67.1M 28 67.1M 28 88.2M 35 41.9M 1-wide 2-wide 1-wide 2-wide 1.00 Current Energy (mA) (mJ) 8.53† 59.6 64.4M 1.04 5.78 46.5 100.4M 0.88 8.35† 87.3 301.8M 0.29 5.70 215.0 6.54 17.1 Did not finish 20.9M 2.00 Did not finish 1-wide 2-wide IPC 73 2.5M 8.5M 0.29 5.24 5.57 MIGRATE “migrate” instr. inside loop Comments Peak IPC achieved. IPC>1 due to branches. IPC not much greater than 1 because only one simple/complex ALU lane, and load/store lane unused. OOO tolerates 14-cycle integer divide instruction well. Hypothesis: larger IQ and non-age-based scheduler exacerbates priority inversion, stalling retirement more. See Table 3, issue #5. Peak IPC achieved. See Table 3, issue #5. Early diagnosis: SQ stalls dominate other resource stalls. Hypothesis: Limited store buffer size of write-through D$ causing back-pressure. Frequent swaps imply many consecutive stores. Write-through latency to FPGA is high. Completes 1 million consecutive thread migrations correctly. Cores at different frequencies. ~1 migration per 1K cycles. † 1-wide core has higher current (and power) than 2-wide core because it has full scan. 11 Electrical and Computer Engineering – North Carolina State University H3 Phase 2 • Scheduled tapeout in August 2015 • Just the two-core stack – No debug core – No scan chains • Design tasks – – – – Partition RTL for two tiers Implement thread migration enhancements Fix bugs Replace T2 D-caches with in-house D-caches • T2 not workable: explicit latch instances, everywhere • T2 no longer needed: shelved crossbar and stacked DRAM L2$ • We now have in-house D-cache from AnyCore effort – Custom-design M1 pads and F2F bondpoints 12 Electrical and Computer Engineering – North Carolina State University AnyCore Overview • FabScalar evolution – Released FabScalar: • Build-up cores from library of stage designs of different widths – Next-gen FabScalar: • “Superset Core”: single Verilog description with parameterized widths (structure sizes already parameterized) • AnyCore derived from “superset core” – Keep static configurability of superset core to allow for synthesis of different max sized AnyCore processors – Add dynamic configurability within max size • AnyCore instance that was fabricated: 13 Adaptive microarchitecture feature fetch/dispatch width (instructions/cycle) issue width (instructions/cycle) physical register file & active list load and store queues (each) issue queue Configurations 1, 2, 3, 4 3, 4, 5 64, 96, 128 16, 32 16, 32, 48, 64 Electrical and Computer Engineering – North Carolina State University AnyCore Design and Verification Notable Design Features • In-house L1 cache designs with three modes – Cache – Scratchpad – BIST (mode after processor reset) • Scratchpad mode with first N rows preset to a test program, including a new instruction that toggles a dedicated pin. • Debug interface for direct reading/writing key pipeline structures, scratchpads, core configuration registers, and performance counters Applied Lessons from H3 Project • Automatic liveness test (BIST) – Upon applying power, clock, and reset, chip should toggle a dedicated pin. This signals it is correctly executing the preset test program. • Got netlist simulation working early – Both post-synthesis (ideal clocks, estimated SDF) and post-layout (everything realistic) 14 Electrical and Computer Engineering – North Carolina State University AnyCore Packaging, PCB, & Bring-up • Applied more lessons from H3 project – Used a socket instead of soldering package to PCB • Possible benefit that PCB may be repurposed for other projects • Replace defective chips • Study variations – Used a dedicated bypassable shunt resistor to measure current – Narrowed PCB to LPC profile. Compatible with arbitrary Xilinx boards (ML605 and Zynq boards). Received Intermediate checks Packaged chips from MOSIS Examined chip orientation, wirebonds, and downbonds under microscope. Checked lead-forming to socket specification. Sockets from Agile Checked fit of package in socket under microscope. Vdd, V+, and Gnd checks with chip in socket. Bare PCB from B.B. Checked all connections. Assembled PCB from B.B. Repeated Vdd, V+, and Gnd checks with chip in socket. BIST: successful retirement of first instructions within 24 hours of receiving assembled PCB. (Toggle Pin observed April 9!) Electrical and Computer Engineering – North Carolina State University 15 Discussion and Questions • Thank you • Any comments or questions? Fully assembled H3 PCB Fully assembled AnyCore PCB The H3 project is supported by a grant from Intel. The AnyCore project is supported by NSF grant CCF-1018517. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the National Science Foundation. 16 Electrical and Computer Engineering – North Carolina State University Backup Electrical and Computer Engineering – North Carolina State University References N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. Proceedings of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011. E. Rotenberg, B. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, R. Basu Roy Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and P. D. Franzon. Rationale for a 3D Heterogeneous Multi-core Processor. Proceedings of the 31st IEEE International Conference on Computer Design (ICCD-31), pp. 154-168, October 2013. Electrical and Computer Engineering – North Carolina State University H3 Detailed Pre-tapeout Timeline Timeline - Design and Verification 2012 Jan Feb Mar Apr May Jun Jul 2013 Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Brandon: cores generate dual hetero. cores with FabScalar dust off in-order scalar core (cut from tapeout) in-order core checked-in to SVN repo ↑ FabScalar-generated cores checked-in to SVN repo L1 instruction cache modify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializer cache-core decoupling (CCD) for both I- and D-caches synthesis and scan chain insertion RTL freeze on "allcores" (Sep. 19, 2012) delivered synthesized netlist with scan chains inserted regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF) post-RTL-freeze verification (RTL simulation) post-tapeout verification (P&R netlist simulation) hetero. multi-core research (C++ simulator, power models, etc.) Rangeen: L1 data cache study OpenSparc T2 L1 data cache integrate T2 D$ into in-order core for early design and testing integrated in-order core + D$ checked-in to SVN repo misc. mods to T2 D$ (remove TLB, reset strategy, etc.) redesign per-thread MHSRs for single thread miss-under-miss (MLP) modify core's load/store lane for 2-cycle MEM stage integrated OOO cores + D$ checked-in to SVN repo debug Elliott: core-side FTM, perf. counters, debug core global migration perf. counters local migration integrate with face-to-face bus controller debug core: start debug core from 2-wide core debug core: replace L1 I/D caches with I/D scratchpads debug core: replace memory bus with scan I/O Zhenqian: face-to-face bus for FTM Randy: flow, physical design missed tapeout: Tezzaron 3D run deferred indefinitely download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) physical design (floorplan, P&R, DRC, LVS, power integrity) missed tapeout: couldn't close on DRC, LVS tapeout (May 20, 2013) ↑ ↑ ↑ research (culminated in data for ICCD-31 paper) ↑ ↑ ↑ Electrical and Computer Engineering – North Carolina State University ↑ ↑ ↑ H3 Detailed Post-tapeout Timeline Timeline - post-tapeout 2013 2014 Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul 2015 Aug Sep Oct Nov Dec Jan Feb Mar Apr Steve: packaging, PCBs bare die back from MOSIS (Sep. 6, 2013) ↑ debug core: wirebond chip-on-board debug core: package chip; design, fab, and assemble PCB hetero. core pair: package chip; design, fab, and assemble PCB Elliott: chip bring-up develop compiler for controlled benchmark creation <<July 2013>> debug core: Vdd/Gnd short test, current test, scan test debug core: retire millions of instructions debug core: simple power and freq. measurements setup host PC and FPGA infrastructure hetero. core pair: first instructions retired hetero. core pair: synthesize full testbench to FPGA hetero. core pair: 10M+ instr. retired (no memory ops) hetero. core pair: on-going testing and debug hetero. core pair: breakthrough - tracked down flaky reset issue research research Thomas: scan chain scripts (debug core), chip bring-up ↑ ↑ ↑ ICCD-32 - Design-Effort Alloy (DEA) Electrical and Computer Engineering – North Carolina State University ICCD-32 revisions & pres. other Cool H3 Pictures Final layout Full test setup (host PC, power supply, ML605, etc.) Electrical and Computer Engineering – North Carolina State University H3 Errata Symptom Description Stage Workaround 1 Instructions that depend on loads will wake up early when the load misses (reading an incorrect value from the register file/bypass). A misplaced `ifdef guarded key RTL during simulation, but the `ifdef condition was not enabled during synthesis. post-tapeout Prefetch blocks such that loads will hit. 2 Sourcesynchronous clock port errors in FPGA testbench during PAR/mapping. Clock signals from the cores to the FPGA were not routed to clock-capable pins of the LPC. post-PCBassembly Use a jumper from the clock header of the PCB to a clockcapable pin of the XM105 debug card attached to the HPC connector of the ML605. 3 Non-repeatability in running single threaded test programs. High precision ammeter used post-silicon to measure logic current. testing Current draw of running core caused the ammeter to switch shunt resistance (with a relay) immediately after reset was de-asserted. During the switch, the supply as seen by the core dropped below threshold voltage, causing metastability. Use dedicated shunt resistor, and measure voltage drop across that resistor instead of the in-line ammeter. Electrical and Computer Engineering – North Carolina State University H3 Errata 4 Symptom Description Stage Workaround Repeated I-cache misses to the same block. - Frequently-executed and mostly-taken branch in the last slot of a cache block. - A BTB update occurs when the branch is being fetched (Fetch-1 stage), and the update is for this branch. Thus, there is a BTB write and a BTB read to the same entry, that of the branch. The BTB read indicates a miss due to the concurrent read and write. This causes Fetch-1 to fetch the next sequential block, generating an I-cache miss. - The Fetch-1 stage is redirected to the branch’s taken target in the next cycle, when the branch is predecoded in the Fetch-2 stage. - Meanwhile, the I-cache miss request for the sequential block is not canceled. Further, the retrieved block is dropped by the core if the Fetch-1 stage is fetching instructions. - End result: The same miss (of the sequential block) is generated repeatedly. post-silicon testing This issue does not impact performance, but does make it difficult to measure the number of actual I-cache misses. It is possible to add NOPs to ensure frequently-executed and mostly-taken branches are not in the last slot of a cache block. Electrical and Computer Engineering – North Carolina State University H3 Errata 5 Symptom Description Stage Workaround Cache-missed load never retires. After block is retrieved from memory, netlist simulation shows the tag not being written to the correct set (set index goes to x’s), owing to a hold-time violation in the set index for the tag fill. The core replays a cache-missed load until it hits. After the MHSR performs the flawed line-fill, the replayed load will miss again, generating another miss request for the same block. This repeats indefinitely. post-silicon testing Top Core: Hold-time violation on fill path seems to affect only the tag fill, not the data fill. Further, it appears that uninitialized tags in the tag SRAM are, by chance, 0. Thus, limiting load and store addresses to have tags of 0 masks the flawed tag fill. The workaround is not guaranteed but does seem very stable. Bottom Core: Not only does the load’s tag need to be 0, but we also had to lower Vdd to slow circuit paths. With these two measures, the replayed load hits and retires. However, the replayed load does not appear to get the correct data, presumably due to the data fill path being compromised (conjecture). Electrical and Computer Engineering – North Carolina State University H3 Errata Symptom Description Stage Workaround 6 Inability to read/write TRF registers. Attempted read/write of TRF registers in a single-thread experiment. The reset of the F2F controller is fed through clock-synchronizing flip-flop pairs, but the clock of the F2F controller was not running (since it was a single thread experiment). post-silicon testing Ensure F2F controller is clocked when either core is clocked, even when migrations are not being tested. 7 Execution deadlock after a thread migration. The BTB is implemented in SRAM, including its valid bits, so valid bits are not reset after a migration. This caused BTB hits on noncontrol instructions. False BTB hits are detected and recovered in Fetch-2 stage, except CTIQ is still pushed. Those instructions would allocate entries in the CTIQ, but during retirement, the CTIQ entry was not deallocated. The CTIQ would fill, and fetch would stall indefinitely. post-silicon testing Carefully craft test programs such that instruction addresses do not overlap. Hard reset tends to work, but not guaranteed. Electrical and Computer Engineering – North Carolina State University H3 Errata Symptom Description Stage 8 Repeated thread migrations only work for fewer than 33 migrations. The MIGRATE instruction is post-silicon in a loop. If the loop-ending testing branch is fetched before the MIGRATE instruction retires, then a CTIQ entry will be allocated for the branch. However, after the migration, the CTIQ is not reset, so the allocated entries are not deallocated, eventually filling the CTIQ, blocking forward progress. Add instructions (NOPs if necessary) after a MIGRATE instruction to guarantee no branches can be fetched before the MIGRATE retires. 9 Unable to repeatedly migrate from core-to-core when CCD is enabled. Debugging in progress Unknown post-silicon testing Workaround Electrical and Computer Engineering – North Carolina State University Ammeter Shunt Resistor Switching • High precision meter still uses shunt resistor – Our meter switches shunt resistor using relay – The relay causes a lag between changes in core current draw, and the ability of the ammeter to react DCM Power Supply 1 2 3 4 1. FPGA reset pressed (resets core, which then draws near zero current). 2. Ammeter switches to higher resistance Coreshunt resistor 3. FPGA reset released (core starts executing, increasing current draw). 4. Ammeter switches back to lower resistance shunt resistor Electrical and Computer Engineering – North Carolina State University Ammeter Burden Voltage • Using a discrete, fixed value, resistor to measure current also has trade-offs – Increasing resistance gives more accurate measurements of voltage drop (and hence current) – However, that same voltage drop also lowers the voltage as seen by the core – This is why the ammeter switches shunt resistances in the first place DVM Voltage regulated here Power Supply Core Electrical and Computer Engineering – North Carolina State University Power Supply Sense Input • You could bump the voltage of the power supply up so that the voltage as seen by the core is within spec • But then during reset, the core current drops to near zero, and the voltage as seen by the core is roughly the same as the bumped up output of the power supply • A more fully featured power supply, with a sense input, can help this situation… regulation is at the sense node, not the output of node of the power supply Voltage DVM Power Supply Sense regulated here Core Electrical and Computer Engineering – North Carolina State University Custom Compiler • Needed a way to explicitly control all aspects of emitted instructions • Language syntax is assembly with a few higher-level features – Registers and memory locations can be named – if statements and while loops – Arithmetic operators, assignments, address-of • Syntax allows the ability to place code/data at arbitrary memory locations – Including non-contiguous locations • The compiler also emits to our checkpoint file format • Written in flex/bison Electrical and Computer Engineering – North Carolina State University Custom Compiler Example % example program that % sums the values of an array mem (0x00400000) { ii: $r1 addr: $r2 cond: $r4 val: $r5 total: $r6 addr = @data total = #0 ii = #0 cond = ii < #4 while (cond) { lw val, #0[addr] addr = addr + #4 total = total + val ii = ii + #1 cond = ii < #4 } % example program array data mem (0x00100000) { data: !0x0f0f0f0f !0xabcd1234 !0x00000001 !0xdeadbeef result: !0x00000000 } addr = @result sw total, #0[addr] } Electrical and Computer Engineering – North Carolina State University Intel 80386 Errata (it’s not just us!) • Hardware Peculiarity: Unless pin F13 of the 80386 is connected to the +5V power supply, the 80386 never terminates a memory cycle, halting the processor. – Datasheet indicates pin F13 is NC with a note that “Pins identified as ‘NC’ should remain completely unconnected.” • Successive Floating-Point Instructions: If two floating-point instructions are executed close together, the 80386 may force the coprocessor to start the second one too soon if the first one did not require any memory operands. • Misaligned Floating-Point Instructions: If 80287 and/or 80387 instructions are not word-aligned, the 80386 passes the wrong instruction to the coprocessor, causing unpredictable behavior. • Self-test: The self-test feature does not work on the A1 stepping of the 80386. • Double Page Faults: The bug that appeared in the B0 stepping regarding page faults that occur during page faults has been made a permanent feature of the 80386… Turley, James L., Advanced 80386 Programming Techniques, McGraw-Hill, Berkeley, CA, 1988. Electrical and Computer Engineering – North Carolina State University AnyCore Detailed Pre-tapeout Timeline Timeline - Design and Verification 2013 Jan Feb Mar Apr May Jun Jul 2014 Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Rangeen: 6-wide core Add microarchiterular adaptivity to superset Add dynamic reconfiguration control Beef up testbench to support dynamic reconfiguration tests Add debug facility and configuration registers RTL functional and performance verification and debug synthesis, fine grain clock gating and scan chain insertion RTL feature freeze post-synthesis verification (synthesized netlist simulation) post-layout verification (P&R netlist simulation) UPF based power aware design and synthesis flow UPF based power aware netlist simulation and power analysis Rangeen: 4-wide core Add coarse grain clock gates - partition level RTL functional and performance verification and debug synthesis and scan chain insertion post-synthesis verification (synthesized netlist simulation) post-layout verification (P&R netlist simulation) UPF based power aware design and synthesis flow UPF based power aware netlist simulation and power analysis Rangeen: L1 caches Non blocking write through data cache Instruction cache Miss handling and off chip communication machinery integrate and debug Add scratch mode and BIST to caches Change cache sizes and off-chip bus widths Anil: perf. counters, debug access and verification Add perf counters Read and write PRF via debug bus Read AMT via debug bus Verify perf counter read and reset Verify debug access to PRF and AMT Post synthesis verification of debug features Verify read/write access to caches via debug bus Rangeen: physical design Modify H3 flow to suit AnyCore requirements physical design (floorplan, P&R, DRC, LVS) missed tapeout (May 23, 2014): couldn't close on timing - design too dense tapeout (August 22, 2014) Vinesh: scan chain verification - post layout Script to generate bit sequence from executable Add scan support in testbench Verify basic scan functionality Flow Final ↑ research Electrical and Computer Engineering – North Carolina State University research ↑ ↑ AnyCore Physical Design Details Physical design data Technology IBM 8RF (130nm) Dimensions 5 mm x 5 mm Area 25 mm2 Pads 100 (signal, power) (79, 21) Transistors 3.4 million Cells 1.5 million Nets 7.6 million Electrical and Computer Engineering – North Carolina State University AnyCore Detailed Post-tapeout Timeline Timeline - Post Tapeout 2014 2015 Oct Nov Dec Jan Feb Mar Apr Rangeen/Anil: packaging Package bonding diagram Socket selection: dictates lead trim Packaging specifics : lead trim, downbond, fill Packaging validation after receiving packaged parts Anil: PCB design PCB planning and knowledge transfer Socket footprint creation - consult with PCB manufacturer for design rule Rev 1A- minimum required components and headers Rev 1B- debug features, silk screen markings, headers, power measurement Design review and detailed connectivity check with FPGA and headers submit final PCB design to Better Boards (April 1) test bare board at Better Boards (April 7) pick up assembled boards from Better Boards (April 8) Rangeen/Anil/Elliott: chip bring-up insert chip in socket, check Vdd/V+/Gnd, attach to FPGA, BIST success! (April 9) Electrical and Computer Engineering – North Carolina State University ↑ ↑ ↑ ↑ Cool AnyCore Pictures Final layout Floorplan Electrical and Computer Engineering – North Carolina State University AnyCore Pipeline Electrical and Computer Engineering – North Carolina State University BIST Program 0x00 0x08 0x10 0x18 0x20 0x28 0x30 0x38 0x40 0x48 0x50 0x58 0x60 0x68 0x70 0x78 addi r1, r0, #0 addi r2, r0, #0 addi r3, r0, #0 addi r4, r0, #0 toggle nop nop nop st r3(#0), r4 addi r2, r2, #10 addi r1, r1, #5 addi r2, r2, #10 toggle addi r1, r1, #5 addi r2, r2, #10 addi r1, r1, #5 0x80 0x88 0x90 0x98 0xa0 0xa8 0xb0 0xb8 0xc0 0xc8 0xd0 0xd8 0xe0 0xe8 0xf0 0xf8 addi r4, r0, #0 ld r4, r3(#0) addi r1, r1, #5 addi r2, r2, #10 toggle addi r2, r2, #10 addi r4, r4, #2 addi r3, r3, #4 addi r2, r2, #10 addi r1, r1, #5 addi r1, r1, #5 jmp 0x40 nop nop nop nop Electrical and Computer Engineering – North Carolina State University Debug Bus Caches / AMT / PRF Write Data / Wr En Read/Write Addr Configuration Registers Performance Counters Debug Registers (PC, queue pointers etc.) Electrical and Computer Engineering – North Carolina State University Read Data