Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example Quick Background We’re Centaur Technology Inc. (Austin, TX) We design x86 processors Have been alive for 11 yrs, have shipped processors for 8.5 We operate independently, but are owned by VIA We are a tiny group; but shipping millions of processors/yr Our processors are software & bus compatible with Intel x86 Parent company is VIA Technologies (Taiwan) But are unique vs. Intel & AMD (re design & target market): + lower cost (price) + lower power consumption + smaller chip footprint + unique integrated security features – generally, lower performance This fits some rapidly growing “new” markets for x86 They manufacture, market & sell our processor designs They develop all other PC platform chips (including chip sets for Intel & AMD processors), etc. C5J (aka VIA Esther, VIA C7-M) First Shipped 8/2005 90nm IBM SOI Technology P-M bus and new VIA “V4”bus (400-800 MHz) +2-way SMP support unique nanoBGA package 128KB, 32-way exclusive L2 64KB 4-way D-cache P4 instructions (incl SSE2 & SSE3) 400 MHz–2.0 GHz 64KB 4-way I-cache P-M power mngt features+ Exclusive security features 31.2 mm2 26.2 M transistors Lowest Power/MHz 3.5W @ 1 GHz TDP 20W @ 2 GHz TDP 90nm VIA C7-M 90nm Intel Pentium M (Dothan) our die cost 84 mm2 31 mm2 4 C5J Die 6.9 mm Bus & APIC 64 KB 4-way L1-D 128 KB 32-way L2 fuses DCU Fetch, Decode & I-unit Translate 64 KB 4-way L1-I x87 FP ROM SSE 1,2 & 3, MMX Br pred Security PLLs etc Our Security Strategy Provide comprehensive set of data security functions …That are very secure …That are world’s fastest (for a single chip) These goals require that the functions …Be Integrated tightly into the processor core Processor silicon & implementation is fastest hdw Only hdw can be “trusted” (no viruses, etc.) …Require no operating system support/involvement available via non-privileged x86 instructions hardware must manage multi-tasking considerations Available in all of our processors, for free We believe data security should be built into all processors It’s easy to do & small (effectively free) It’s our hobby Our Security Implementation Hardware RNG C5XL (shipped 1/2003) C5P (shipped 1/2004) C5J (shipped 8/2005) CN (future) Hardware RNG unit 2 units Encryption Secure Hash Full AES (FIPS-197) standard in hdw ECB,CBC,CFB,OFB Modes in hdw fastest in world! fastest in world! (can also feed +CBC/CFB-MAC modes entropy to hardware +CTR mode SHA to get faster +unaligned support (faster/better highbuilt-in quality) using hdw +faster hash functions) RSA Hdw Assist (Montgomery multiply) xxx Full SHA-1 & -256 (FIPS-180-1) standard in hdw xxx ~ adj DC bias 2 duplicate RNGs in different physical areas (& rotated) whitener whitener ~ ~ ~ ~ ~ ~ ~ ^ ^ Centaur Hardware RNG asynch clocked 1-byte per delivery A, B, or both 32 byte hardware collection buffer up to 8-byte delivery per store request 1-of-n bit selector SSE store bus x86 “store-rand” status in EAX instruction RNG “Typical” Performance “Randomness” too hard to describe here, but here’s some basics… Key requirements for “truly random” (per Schneier) Many statistical tests defined & used (& argued about) Unbiased statistical distribution determined by statistics Unpredictability determined by modeling Unreproducibility only hardware need apply Collections of many different statistical analyses FIPS-140-2 useless (4-tests, broken, 20,000 bit sample!) Diehard (18 tests) oriented to software RNGs, 10Mb sample NIST (16 tests) we think the best (much overlap with Diehard) Ent, etc. everyone has one, everyone has their favorite Individual tests entropy important & widely reported, but it’s not randomness chi2 heavily used, especially for huge samples, our favorite Maurer, etc. everyone has their favorite Many different evaluation approaches threshhold value, fixed ranges, probability analysis (p-value) Much analysis & interpretation needed to make sense here RNG “Typical” Performance Performance & randomness varies by part; these are “typical” We have done extensive analysis Many terabytes of data Massive sample sizes (terabyte) Hundreds of chip Our own testbed software Analysis & report by external group www.cryptography.com/research/evaluations.html Here’s an embarrassingly simple summary… Setting Speed (Mbs) white8 white4 raw hashed raw (AES)3 1.7 3.4 28–240 150– 1,000 1. 2. 3. Entropy (byte) 7.9999+ 7.999 7.5-7.95 7.9999+ Randomness 1 MB sample Max sample size random?1 for random2 Y 50 MB-10 GB Y–N 0-10 MB N – Y Passes standard test collections: FIPS, NIST, Diehard “Good” chi2 results Many variations: SHA, random seed size, etc. 1 TB up Centaur AES Encryption Features Full FIPS-197 implemented in hardware Multiple operating modes in hardware Optional extended key generation in hardware Various “experimentation” options supported Accessed via new application-level x86 instructions US export licenses in place Encrypt & decrypt 128b, 192b, & 256b keys 128b data blocks ECB, CBC, CFB, OFB CBC/CFC-MAC & CTR modes For 128b key (both E & D) only Round count 1-16, intermediate round results, etc. No OS support needed Hardware provides inherent multitasking Centaur AES Hardware SSE load bus 16-byte blocks key ctrl input 0 input 1 can pipeline 2 blks in ECB Extended Key Ram 16x16B Round key shared generation logic round key Everything runs at processor clock speed block startup + CBC, CFB, OFB, etc. S-box row-shift round fwd column mix key add block finish + CBC, CFB, OFB, etc. out 0 SSE store bus 0.3 mm2 total! out 1 blk-blk fwd Centaur AES Performance AES instruction performance (approx.) 128-bit key & block size: usual instruction timing assumptions = data in cache, no interrupts, aligned, key done, etc. Approximate clocks w/ 128b extended keys already loaded ECB, 1 block: 17 clocks ECB, large block count: 11.8/blk CBC/CFB/etc, 1 block: 37 CBC/etc, large block count: 22.5/blk Additional extended key generation/load time (128b key) Hardware generated: 38 Loaded from memory: 53 AES Performance Measured Performance P4 = Gladman library AES, C5J = replaced routine with AES inst ECB mode (other modes slower, but same advantage over P4) Same memory size (512MB), same bus speeds (533 MHz) data size 8 KB 64 KB 1 Mb 10 MB 2.53-GHz P4 2.0-GHz C5J 0.56 Gb/s 0.56 0.56 0.56 21.5 Gb/s 19.5 5.45 5.23 bus limited Another example: Gladman reports (his site) using his library (ECB) data size 1.2-GHz C5P 16 Kb 15.2 Gb/s Earlier part C5J Montgomery Multiplier Features Goal: Speed up RSA’s modular exponentiation c = me mod n is dominated by repeated d = m x y mod(n) ops where m, y, n are thousand bits long! This multiply is “always” done using “Montgomery Multiply” algorithm Uses special number space to make d’ = a’ x b’ mod(m) much faster by eliminating divide But initial & result values must be transformed to/from Montgomery number space In real usage, the transformation overhead is relatively small Our hardware directly performs “Montgomery Multiply” About as fast as an ordinary multiply! For up to 32Kb numbers! New application-level x86 MontMul instruction Centaur Montgomery Multiplier SSE load bus Usable with any size data (256 to 32Kb, 128b steps) A[j] T[j] 32 temp regs 32 32 M[j] 32 U B[i] 32 x 32 hack of existing multipliers 32 x 32 64 64 Ucode sequences loads & stores + Hi 33b 64 33 32b x 32b mod(32b)= 4 clks (2 clk pipelined) 33 + Bits 64:32 Bits 31:0 SSE store bus T[j-1] 16-byte blocks Centaur MontMul Performance Compared to GMP library Perform c = me mod n (m,e,n chosen randomly) An example (speeds vary slightly based on values) Note: this is most of RSA time, but not the whole thing Same hardware as for AES chart mod size (bits) 2.53-GHz P4 2.0-GHz C5J 512 1024 1536 2048 340 exp/s 50 15.6 7.1 1800 exp/s 243 78 35 Centaur SHA Features FIPS-180-1 completely implemented in hardware Instruction timing Measured performance (Gb/s) SHA-1 (160-bit result) SHA-256 (256-bit result) SHA-1: 251 clks SHA-256: 262 where n is the number of 64B blocks to be compressed Same hardware as for AES chart, GPL SHA SW (Devine) 2.53-GHz P4 data size SHA-1 2.0-GHz C5J SHA-256 SHA-1 SHA-256 10 B 0.07 0.04 0.38 0.35 100 B 0.43 0.24 2.41 2.24 1,000 B 0.59 0.33 3.81 3.60 1,000,000 B 0.62 0.34 2.97 2.97 bus limited C5J SHA Hardware SSE load bus Initial digest next 64b data 160b 64 accumulating digest data scheduler SHA-1: 2 clks/32b rnd (5) (16 x32b regs) Function generators SHA-256: 3 clks/round regs 5-way add Final sha-256 add SSE store bus + + Build Process 20 The Centaur Process Manufacturing Engineering mfg requirements timing data Timing Process (full chip) Power? Bugs? MHz? test vectors, test programs, test fixtures, packages, FA, silcon debug, qual, MHz calibration, etc. feedback to process Design Process arch logic circuit layout technology The Processor Source verilog schematics layout global wires microcode models bugs Physical Build & Tapeout Process Mask Data Foundry "release to manufacturing" Silicon tech data & support feedback to process Design Verification (full chip) Compatible? CPI ? Power? MHz? fails Via Manufacturing, Marketing & Sales hdw & sw compatibility, benchmarks, power data, supported hdw, etc. requirements System Verification 21 Centaur Build Methodolgy Our challenges! Complex logic with lots of architectural interconnections 2-GHz & aggressive power/size objectives Relatively few designers (30 logic & circuit) Strong schedule pressure (must do it fast) Industry tools not sufficient (oriented to APR methodology) Our Basic Approach Hundreds of top-level stand-alone “blocks” Allows parallel development of “one-person” blocks Facilitates fast “build” time (chip assembly, timing, etc.) Facilitates use of optimum process for particular logic Hook blocks together with top-level routing, clocks, etc. Significant “content” added in top-level build Full-chip timing with fast iterations Fast full-chip build iterations Develop our own tools & methodology to accomplish above Centaur Chip Physical Build Process processor.v Defines the top-level blocks & the connecting global wiring APR blocks Verilog sources for each physical APR (control logic) datapath stacks Verilog source (special format) for each physical datapath stack Verilog for control logic to be placed in buffer section circuit elements Schematics for each physical custom element/ block Plus timing models, size models, RTL behaviorals, etc. processor.mc On-chip microcode global wiring definitions verilog-to-layout APR flow standard cell library verilog-tolayout stack flow Full-chip integration & build process stack element library custom layout flow custom blocks microcode flow ROM s routing RC repeaters clock tree power/grd via add cap fill C5J Die I/O Drivers 62 Full Custom Blocks (299 instances) 3.12 mm2 4.82 mm2 (I/O) 1.32 M xistors I / O D r i v e r s 20 APR Blocks 2.32 mm2 5.89 m routing 63 DatapathStacks 6.62 mm2 12.38 m routing 1.18 M xistors 3.38 M xistors I / O perf optimized routing widths & spacing Global RC Repeaters Global Wiring interconencting all top-level blocks 21,512 nets 22.73 m routing automatic insertion tool 3,500 x 7 bfrs inserted Clock Distribution Network hand-drawn 49 top-level elements/395 nets 10 Bit-CellArrays (18 instances) 8.02 mm2 20.02 M xistors 2 ROMs 0.39 mm2 0.48 M xistors Power/ground Grids both hand-drawn & automatic I/O Drivers D On-die Decoupling Caps r i automatic insertion tool v e r s Underlying Source Statistics Verilog lines as written (small) (no behaviorals, no comments, no clocks, no “top” chip) APR logic 112K lines 129K cells Stack logic 41K lines 172K cells Note: this is “single instance” as written much of this gets instantiated multiple times Schematic “pages” as written (large) Primitive (inv, nand2, nor2, etc.) Standard cells Datapath elements Full customs Circuit library size Clock regens Std cell G datapath elements W datapath elements 110 712 1308 1332 ------3462 avail used 445 547 493 248 ----1733 277 435 271 147 ----1130 C5J Security Components (metal 1-4 only) clock repeaters stk stk stk stk stk custom global clk meanders 7 RC bfrs stk APR (control for all stacks) Note: global interconnects not shown 32b data “bfr” section decoupling caps stk stk C5J Security Components (metal 1-4 only) SHA sch & ALU key RAM common control logic RNG buffers 128b-wide AES engine “Fast Build & Timing” Every 1-5 days Full-chip “Release” APRs synthesized, placed & RCs estimated Stacks “cracked”, placed & RCs estimated Full-chip timing done with estimated RCs Takes < 1 day for full-chip timing report Every 5-10 days Full-Chip Physical Build APRs routed Stacks routed Global chip routed Global chip layout produced APRs, stack & global route RC extraction RCs feed back to calibrate estimated RCs This goes on continuously, picking up new Releases as needed Our experience at other companies much slower Basic “Release” Process phy lib pwr lib timing tech lib lib shape from floorplan synthesis & place APR blks processor.v verilog source modules 5-10 day cycle APR build RC estimator floorplan clock tree wire control etc. nl I/O constraints merge flatten subsitute expand rename split 1-5 day cycle Full-Chip Timing gen auto or by hand global nl tech file element timing models gen RC build ctl nl nl nl synthesis ctl Stack place split DP stacks dp RC estimator Stack build global build RTL Design Rules APR Blocks Element instantiation OK Registers (req’d synthesis can’t infer them correctly) Clock buffers & distribution (req’d synthesis clocks are slow!!) Occasional logic (this has diminished over time) The instantiated elements are really macros Auto expanded to right size, number bits, etc. in the flow Wires & continuous assignment OK Nothing else! (no procedural stuff) Including operators like ?, +, < etc. No if/else, no case, no loops, no “always”, no “at”, etc. No timing information/control Synthesis generates bad logic for these Unexpected/surperflous elements, registers where not expected, timing doesn’t work, etc. Stacks Component instantiation & wires only! APR RTL Example As Written assign idleNS = (T[0] | T[8]) | shaDone_P; assign funcNS = (T[1] | T[3] | T[6] | T[10]) & ~shaDone_P; assign add1NS = (T[2]) & ~shaDone_P; assign add2NS = (T[5]) & ~shaDone_P; assign faddNS = (T[4] | T[7] | T[9]) & ~shaDone_P; rregs #(5) state (.q ({idleState, funcState, add1State, add2State, faddState}), .d ({idleNS,funcNS,add1NS,add2NS,faddNS}), .clk (ph1c) ); -----------------sha2cnst sha2cnst(.in (iteration[5:0] ), .ksel (shKSel ), .algo (sha1_P ), .out (KsubI )); -----------------wire [6:0] nextIteration; assign nextIteration = (shaDone_P | idleState) ? 7'b0000000 : shIterationStall ? iteration : iteration + 1; 32 Stack RTL Example Datapath Section /*------------------- KeyGen XOR --------------------------*/ wire [31:0] aesKeyGenXorOut2_L; zdxor #(32,15) keyg1 (.out (aesKeyGenXorOut2_L ), .in0 (aesWord2I_LB ), .in1 (aesKeyGenXorOut1_LB )); zinv #(32,60) kgen2 (aesKeyGenXorOut2_LB, aesKeyGenXorOut2_L); wire [31:0] aesKeyGenXorOut2_MB; wire [31:0] aesKeyGenXorOut2_M; zregi_en #(32,10) keyg2 (.q (aesKeyGenXorOut2_MB ), .d (aesKeyGenXorOut2_L ), .clk (EPH1 ), .en (aesDynEn_K)); zinv #(32,10) keyg2i (aesKeyGenXorOut2_M, aesKeyGenXorOut2_MB); Buffer Section rregsi #(2,20) bf_kk (.qb (aesKeyMuxSel_M ), .d (aesKeyMuxSel_LB), .clk (evph1)); 33 Stack Placement Tool Output (32-bit AES stack) Buffer section added Inter-element routing (m2-6) Global wires added Sample Timing Report “Path” time path 0.875ns eeph1aesdp2 ^ 0.925ns aesdp2/eph1 ^ element aesdp2/eph1buf_aesdp2/ aesdp2/sc_c0ph1_48/ delta load cap wire rise/fall 0.050ns 0.2423pF 0.000ns 0.000ns 0.160ns 0.0321pF 0.000ns 0.000ns 1.085ns aesdp2/keyg2_ph1 ^ aesdp2/gxregi_x4_10…………………… 0.063ns 0.0035pF 0.000ns 0.004ns 1.148ns aesdp2/aesdp2_dp_aeskeygenxorout2_mb10 v 0.000ns 0.0035pF 0.000ns 0.004ns 1.148ns aesdp2/aesdp2_dp_keyg2i_stack_bit10_i0 v aesdp2/ginv_10………………………………… 0.026ns 0.0209pF 0.000ns 0.044ns 1.173ns aesdp2/aesdp2_dp_aeskeygenxorout2_m10 ^ 0.000ns 0.0209pF 0.000ns 0.045ns 1.174ns aesdp2/aesdp2_dp_invk_stack_bit10_i0 ^ aesdp2/gemux3i_19………………………… 0.045ns 0.0336pF 0.000ns 0.031ns 1.219ns aesdp2/aesdp2_dp_key_mb10 v 0.000ns 0.0336pF 0.000ns 0.031ns 1.219ns aesdp2/aesdp2_dp_kml_stack_bit10_i0 v aesdp2/ginv_31………………………………… 0.017ns 0.0188pF 0.000ns 0.013ns 1.236ns aesdp2/aesdp2_dp_key_m10 ^ 0.001ns 0.0188pF 0.001ns 0.014ns 1.236ns aesdp2/aesdp2_dp_mixcoldec_xorout_stack_bit10_in0 ^ aesdp2/gxor8_10……………………………… 0.095ns 0.0170pF 0.000ns 0.029ns 1.331ns aesdp2/aesdp2_dp_decout_m10 v 0.000ns 0.0170pF 0.000ns 0.030ns 1.332ns aesdp2/aesdp2_dp_mcmux_stack_bit10_i2 v aesdp2/gmux3i_10………………………… 0.030ns 0.0089pF 0.000ns 0.017ns 1.362ns aesdp2/aesdp2_dp_mcout_mb10 ^ 0.000ns 0.0089pF 0.000ns 0.017ns 1.362ns aesdp2/aesdp2_dp_invm_stack_bit10_i0 ^ aesdp2/ginv_31……………………………… 0.030ns 0.1101pF 0.000ns 0.053ns 1.391ns aesdp2/aesdp2_dp_mcout_m10 v 0.012ns 0.1101pF 0.012ns 0.078ns 1.403ns aesdp2/aesdp2_dp_pipemux0_stack_bit10_i1 v aesdp2/gmux2i_16…………………………… 0.048ns 0.0249pF 0.000ns 0.030ns 1.451ns aesdp2/aesdp2_dp_aesword2i_kb10 ^ 0.001ns 0.0249pF 0.001ns 0.032ns 1.452ns aesdp2/aesdp2_dp_byte1_indx_pb2 ^ Local reg clock-to-next reg input = 1.452-1.085 = 367ps 37 Random Circuit Topics Clocking is very difficult & very critical Very aggressive skew goals “0” ps clock skew across all top-level blocks <20 ps skew worst case within a block These are met in our designs ignoring on-chip silicon variations Multiple clock domains required (for bus & various power states) Many “early”, late”, etc. versions of the clocks needed Clocks must be gated (for power management) Our clocking methodology is proprietary, but… Hand-routed global clock tree (continually changing) Our own tools to generate clock shields tuned to surroundings Tunable “repeaters” (via fuse & via metal) Hand instantiated clock elements within blocks Many selectable clocks (xx ps for each reg) Auto-generated clock grids within APRs & stacks Fuse adjustable PLL characteristics (duty cycle, etc.) Power/ground distribution critical Extensive analysis & “management” required Random Circuit Topics (cont) Robust circuit design req’d across 12 “corner” models 54 formal corners identified, we choose the most critical “12” Covers variations in: Temp, V, N xistor, P xistor Automated element simulation done across these models Full-chip timing is done using 2 of these corners (hi V, lo V) Extensive use of dynamic logic Two stack-element libraries Element libraries has several versions of same function Precharge in phase 1, evaluate in phase 2 Registers, adders, comparators, arrays, etc. Customs, stacks (& APRs) With different bit pitches Usually, at least “Fast/big/hot” & “slow/small/cool” Example: C5J has 2 different “vanilla” 32-bit adders Fast (dynamic): 180 ps Slow (static): 250 ps 37.9 high 16.9 high Note: 25 total adders in library, instantiated 65 total times Random Circuit Topics (cont) Several families of registers available Differ in function, speed, size & performance Std cell, datapath & custom versions Each comes in many drive strengths (sizes) Many have built-in functions muxes, and/or logic, xors, compares, etc. These provide speed/size/power improvements vs. separate elements Examples using C5J stack elements 26b normal reg 3.8 x-reg 10 fast reg 5.0 k-reg 10 26b 4.6 static cmp-eq 20 3.8 1.4 k-reg 10 inv 54 1b 82 ps (data-to-out) 32 ps 90 32 17 ----139 ps 26b 26b k-reg +dynamic cmp-eq 60 1b 88 ps 9.5 41 C5J Security Component Sizes (mm2) Sample scale 0.014 0.014 227 0.080 0.080 0.091 0.069 0.046 0.021 0.080 0.034 Total = 0.529 mm2 + 0.014 for 2 RNG’s (elsewhere) = 0.54 (a few cents, but for this chip it’s really free) C5J Security Component Sizes (If we had only known about all this space when we started…) 0.014 0.014 227 0.080 mm2 0.080 0.091 0.069 0.046 0.080 0.034 Note: We had so much spare room on die that we didn’t spend any effort making this smaller. We estimate at least 30% smaller if we tried hard! Startup, CBC, etc. muxes & registers ---register---------------------------------- S-box ROM (2 x 256 x 8 bit) x 4 bytes 200 ps access (dynamic) Row-shift muxes (wires to other 32b stacks not visible) ---register---------------------------------- Column multiply (& key xor) made out of 2-,3-,4-,5-,6-, 7- & 8-input xors ---register---------------------------------- Startup, CBC, etc. muxes & registers (extra stuff at bottom for key generation)