The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008 Background • CRAY-1 by no means first vector machine – 1960s: Westinghouse Solomon/ILLIAC IV – 1974: CDC STAR 100 • “I never, ever want to be a pioneer” --Cray – STAR 100, ILLIAC IV: who's this Amdahl dude? • 1972: Cray Research formed after spat with CDC – Seymour Cray wanted to start from scratch on 8600; CDC brass, not so much • 1976: first CRAY-1 deployed at Livermore CRAY-1 Hardware Look Ma, No ASICs! CRAY-1 Architecture • • • • 5-ton, vector uniprocessor Word size = 64 bits 80 MHz clock 8MB RAM in 16 banks @ 20 MHz – fcpu/fmem = 4 (!!) • Fairly RISCy 16- or 32-bit instructions – Load/store; register-register operations Scalar Operation and Octal Annoyance • 108 A-registers for 24-bit address calculations • 1008 B-registers serve as backing store for A-registers • 108 S-registers for source/dest of scalar integer/FP insns • T is to S as B is to A • 118 pipelined scalar FUs – Address add, mult – Integer add, shift, logic, pop count – FP add, mult, reciprocal Scalar Operation • Protection without virtual memory – Base & limit address regs • Ld $dest,$addr actually loads from $base+$addr • Program killed if $base+$addr >= $limit • A handful of registers for interrupts, exceptions, etc. OS and Front End • cos (CRAY OS) handles job scheduling, storage management (tapes!), other I/O, checkpointing – Packaged with CAL (assembler) – ...and CFT (Fortran compiler), more later • Command-line interface and job submission via separate front-end computer, e.g. VAX Vector Operation (Finally!) • 8x64-word V-registers • Vector Length Register – Indicates # ops performed by vector insns – Set from contents of an A-register • Vector Mask Register – Indicates which elements in vector to operate on – Set by vector test insns (e.g. VM[i] := ($Vk[i] == 0)) • 6 Vector FUs – integer add, shift, bitwise logic – FP via scalar FPU: add, mult, reciprocal Vector Load/Store Architecture • Big departure from STAR 100: register-register ops • CRAY-1 memory bandwidth == 80Mword/s == 1word/cycle – If all 2-source insns are memory-memory, then IPC=1/3! (and that assumes no bank conflicts!) – Solution: the RISC approach • Combined with chaining (next), can sustain >> 1 flop/cycle Chaining • Pipeline bypass meets vectors • Consider SAXPY vector expression a*X+Y – Slow approach: compute a*X (64 mults), then compute a*X+Y (64 adds) • Total latency: 128+mult latency+add latency – since, in CRAY-1, all FUs are pipelined – But... no fundamental serialization requirement • As soon as a*X[0] is computed, can compute a*X[0]+Y[0] • Total latency: 64+mult latency+add latency (speedup of almost 2) Chaining Example • Assume: 8-element vectors, single-cycle ops mul.ds $v2,$v3,$s1 add.d $v1,$v2,$v1 • Without chaining: mmmmmmmm aaaaaaaa • With chaining: mmmmmmmm aaaaaaaa Vector Startup Times • For vector ops to be efficient enough to justify, startup overhead must be small • CRAY-1 can issue a vector insn every cycle, assuming no structural hazards on FUs – Result: vector performance > scalar performance for as few as four elements/vector Cray Fortran Compiler (CFT) • Important insight: hand-coding assembly sucks • The actual important insight: most vectorizable code is of the embarrassingly-parallel variety – Even with 1970s compiler technology, innermostloop parallelism is low-hanging fruit – Exploit this—make the compiler do the heavy lifting • CFT is pretty good for branchless inner loops • ...but doesn't even attempt to vectorize code with IFs – So any use of the Vector Mask register must be hand-coded • Upshot: a good start, but not quite there Analysis • Extremely fast computer for 1976 • Thought experiment: what if CRAY-1's parameters scaled with Moore's Law? (32 years == 21 doublings) – 200,000 transistors => 400 billion transistors – 8MB main memory => 16TB main memory – 80 MHz clock => petahertz? (if only) • For a (merely) 2nd-generation vector processor, the CRAY-1 was ahead of its time (I think) – I'm not the only one: it was commercially phenomenal • However, design techniques (discrete logic) are totally unscalable Questions? Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008 The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008