TM Origin System Architecture Hardware and Software Environment TM Scalar Architecture Register File ~2GB/s ~10 cy memory Functional Unit (mult, add) Cache ~500 MB/s ~100 cycles Processor Reduced Instruction Set (RISC) Architecture: • load/store instructions refer to memory • functional units operate on items in the register file • memory hierarchy in the Scalar Architecture – Most recently used items are captured in the cache – Access to cache is much faster than access to memory TM Vector Architecture Vector Operation DO i=1,n DO k=1,n C(i,1:n)=C(i,1:n) + A(i,k)*B(k,1:n) ENDDO ENDDO i = C Vector Registers Functional Unit (mult, add) Processor k i X k + Accumulate C(1,1:n) in a vector register A memory B loadf loadv mpyvs addvv f2,(r3) v3,(r3) v3,v3,v2 v4,v4,v3 load scalar A(i,k) load vector B(k,1:n) calculate A(I,k)*B(k,1:n) update C(I,1:n) • Vectors will be loaded (loadv instruction) from memory • The performance is determined by memory bandwidth • Optimization takes vector length (64 words) into account TM Multiprocessor Architecture Register File Functional Unit (mult, add) memory Cache Cache Coherency Unit Processor Register File Cache Functional Unit (mult, add) Cache Coherency Unit Processor Cache coherency unit will intervene if two or more processors attempt to update same cache line • All memory (and I/O) is shared by all processors • Read/write conflicts between processors on the same memory location are resolved by cache coherency unit • Programming model is an extension of single processor programming model TM Multicomputer Architecture Main memory Register File Functional Unit (mult, add) Cache Main memory Register File Functional Unit (mult, add) Cache Processor Processor • All memory and I/O path are independent • Data movement across the interconnect is “slow” • Programming model is based on message passing – Processors explicitly engage in communication by sending and receiving data TM Origin 2000 Node Board Basic Building Block Main Memory Directory Directory >32P XIO Hub R1*K R1*K Cache Cache Node Board •2 X R12000 Processors •64 MB to 4 GB Main Memory Hub Bandwidth Peaks • 780 MB/s [625] --- CPUs CrayLink •780 MB/s [683] --- memory •1.56 GB/s [1.25] -- XIO link •1.56 GB/s [1.25] -- CrayLink TM O2000 Node Board Directory SDRAM Main Memory up to 4 GB/node SDRAM (144@50 MHz=800MB/s) L2 Cache 1-4-8 MB R1x000 processor L2 Cache Proc Interface R1x000 processor Memory Interface HUB I/O Interface CrayLink duplex connection (2x23@400 MHz, 2x800 MB/s) to other nodes 1-4-8 MB Input/Output on every node: 2x800 MB/s HUB Crossbar ASIC: • Single chip integrates all 4 Interfaces: HUB ASIC: 950K gates 100MHz 64bit BTE 64 counters /(4KB)page – Processor Interface; two R1x000 processors multiplex on the same bus – Memory Interface, integrating the memory controller and (Directory) Cache Coherency – Interface to the CrayLink Interconnect to other nodes in the system – Interface to the I/O devices with XIO-to-PCI bridges • Memory Access characteristics: – Read Bandwidth single processor 460 MB/s sustained – Average access latency 315 ns to restart processor pipeline TM Origin 2000 Switch Technology Main Memory Directory Directory >32P Proc. 6 ports to XIO XBOW Hub Proc. N Cache Cache Node Board Router to other Node Boards N N N R R N R N R N N R ccNUMA hypercube R N N N N R R N N N N TM O2000 Scalability Principle Main Memory Main Memory Directory SDRAM Directory SDRAM L2 Cache 1-4-8 MB L2 Cache 1-4-8 MB Link Interface R1x000 processor Proc Interface R1x000 processor Memory Interface Memory Interface HUB R1x000 processor L2 Cache 1-4-8 MB L2 Cache 1-4-8 MB HUB I/O Interface I/O Interface R1x000 processor Crossbar router network Distributed switch does scale: – Network of crossbars allows for full remote bandwidth – The switch components are distributed and modular TM Origin 2000 Module System Building Block Module Features: •Up to 8 R12000 CPUs (1-4 Nodes) •Up to 16 GB physical memory •Up to 12 XIO slots •2 XBOW Switches •2 Router Switches •64 bit internal PCI Bus (optional) •Up to 2.5 [3.1] GB/sec system bandwidth •Up to 5.0 [6.2] GB/sec I/O bandwidth Origin 2000 Module Deskside System • 2-8 CPUs • 16GB Memory • 12 XIO slots SGI 2100 / 2200 N R R N N N TM TM Origin 2000 Single Rack Single Rack System • 2-16 CPUs • 32GB Memory • 24 XIO slots SGI 2400 N N N R R N R R N N N N TM Origin 2000 Multi-Rack Multi-Rack System • 17-32 CPUs • 64GB Memory • 48 XIO slots • 32-processor hypercube building block N N N N R N N N R N R N R N R N N R N R R N N N TM Origin 2000 Large Systems Large Multi-Rack Systems • up to 512 CPUs • up to 1 TB Memory • 384+ XIO slots + SGI 2800 = + + TM Scalable Node Product Concept Address diverse customer requirements • Independent scaling of CPU, I/O, and storage…tailor ratios to suit application • Large dynamic range of product configurations • RAS via component isolation Independent evolution and upgrade of system components Maximize leverage of engineering and technology development efforts Modular Architecture Interface and Form Factor Standards I/O SUBSYSTEMS TM Origin 3000 Hardware Modules (BRICKS) G-brick Graphics Expansion C-brick CPU Module R-brick Router Interconnect I-brick Base I/O Module P-brick PCI Expansion X-brick XIO Expansion D-brick Disk Storage Origin 3000 MIPS Node 128 Nodes / 512 CPUs per System (Max) Memory Interface 4x O2K Bandwidth 200 MHz, 3200 MB/sec 60% O2K Latency 180 ns local 8 GB/node (Max) DDR SDRAM TM Two Independent SysAD Interfaces Each 2x O2K Bandwidth 200 MHz, 1600 MB/sec each L2 Cache R1*000 R1*000 L2 Cache L2 Cache R1*000 R1*000 L2 Cache Mem/Dir Bedrock ASIC NUMALink3 Network Port 2x O2K Bandwidth 800 MHz, 1600 MB/sec Bi-directional XIO+ Port 1.5x O2K Bandwidth 600 MHz, 1200 MB/sec Bi-directional Origin 3000 CPU Brick (C-brick) • 3U high x 28” deep • Four MIPS or IA64 CPUs • 1 - 4 DIMM pairs: 256MB, 512MB, 1024MB (premium) • 48V DC power input • N+1 redundant, hot-plug cooling • Independent power on/off • Each CPU module can support one I/O brick TM TM Origin 3000 BEDROCK Chip TM SGI Origin 3000 Bandwidth Theoretical vs. Measured (MB/s) CPU CPU 1600 1600 1600 1600 1600 CPU 900 900 900 CPU 3200 Memory 2x1600 2x1250 CPU CPU 1150 1150 1600 Hub node CPU 900 CPU Hub 2100 Memory node TM STREAMS Copy Benchmark 3000.0 Megabytes/sec 2500.0 2000.0 1500.0 1000.0 500.0 0.0 1 2 4 8 Origin 2000 R12KS 400 MHz 380.0 381.0 820.0 1538.0 Origin 3000 R12KS 400 MHz 623.0 777.0 1406.0 2855.0 Origin 3000 R14K 500 MHz 685.0 778.0 1401.0 2823.0 Number of CPUs SGI Confidential Origin 3000 Router Brick (r/R-brick) TM •2U high x 25” deep •Replaces system mid-plane •Multiple Implementations – r-Brick…6-port (up to 32 CPUs) – R-Brick…8-port (up to 128 CPUs) – metarouter…(128 to 512 CPUs) •48V DC power input •N+1 redundant, hot-plug 8 NUMAlink™ 3 NW Ports Each port...3.2GB/s (2x O2K bandwidth) cooling •Independent power on/off •Latency 50% ORIGIN 2000 – 45 ns NUMAlink™ 3 Router 45ns roundtrip latency (50% O2K router latency) TM SGI Origin 3000 Measured Bandwidth 5000 MB/s Router 2500 2500 SGI NUMA 3 Scalable Architecture (16p - 1hop) TM R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 Bedrock ASIC Bedrock ASIC Bedrock ASIC 8-port Router To other Routers Bedrock ASIC TM Origin 3000 I/O Bricks I-brick: Base I/O Module P-brick: PCI Expansion X-brick: XIO Expansion • Base system I/O: • 12 industry-standard, • Highest • • system disk • CD-ROM • 5 PCI slots No need to duplicate starting I/O infrastructure 64-bit, 66MHz slots • Supports almost all system peripherals • All slots are hot-swap performance I/O expansion • Supports HIPPI, GSN, VME, HDTV • 4 XIO slots per brick New I/O bricks (e.g., PCI-X) can be attached via same XIO+ port Types of Computer Architecture characterised by memory access PVP (SGI/Cray T90) UMA SMP Central Memory (Intel SHV, SUN E10000, DEC 8400 SGI Power Challenge, IBM R60, etc.) COMA (KSR-1, DDM) Multiprocessors Single Address space Shared Memory NUMA CC-NUMA distributed memory (SGI Origin2000, Origin3000, Cray T3E, HP Exemplar, Sequent NUMA-Q, Data General) NCC-NUMA (Cray T3D, IBM SP3) Cluster MIMD (IBM SP2, DEC TruCluster, Microsoft Wolfpack, “Beowolf”, etc.) Multicomputers Multiple Address spaces loosely coupled, multiple OS NORMA no-remote memory access “MPP” (Intel TFLOPS,TM-5) tightly coupled & single OS MIMD UMA NUMA NORMA MPP Multiple Instruction s Multiple Data Uniform Memory Access Non-Uniform Memory Access No-Remote Memory Access Massively Parallel Processor PVP SMP COMA CC-NUMA NCC-NUMA Parallel Vector Processor Symmetric Multi-Processor Cache Only Memory Architecture Cache-Coherent NUMA Non-Cache Coherent NUMA TM TM Origin DSM-ccNUMA Architecture Distributed Shared Memory Cache Cache Bedrock Dir XIO+ Cache Cache Main Memory Processor Processor Processor Processor Cache Cache Cache Cache Bedrock Dir Processor Processor Processor Processor Main Memory NUMALink3 and R-Bricks XIO+ Distributed Shared Memory Architecture (DSM) Main memory Register File Functional Unit (mult, add) Main memory TM Register File Cache Functional Unit (mult, add) Cache Cache Unit Processor Coherency Cache Unit Coherency Processor interconnect • Local memory and independent path to memory as with the Multicomputer Architecture • Memory of all nodes is organized as one logical “shared memory” • Non-uniform memory access (NUMA): — “Local memory” access is faster than “remote memory” access • Programming model is (almost) the same as for the Shared Memory Architecture — data distribution is available for optimization • Scalability properties similar to the Multicomputer Architecture TM Origin DSM-ccNUMA Architecture Directory-Based Scalable Cache Coherence Cache Cache Bedrock Dir XIO+ Cache Cache Main Memory Processor Processor Processor Processor Cache Cache Cache Cache Bedrock Dir Processor Processor Processor Processor Main Memory NUMALink3 and R-Bricks XIO+ TM Origin Cache Coherency • Memory page is divided in data blocks of 32 words or 128 Bytes each (L2 cache line size) • Each data request transfers one data block (128 Bytes) • Each data block has associated presence and state information directory presence state (64 bits) 8bits Data Block or Cache line 128 Bytes (32 words) presence state (64 bits) 8bits page Data Block or Cache line 128 Bytes (32 words) Unowned: no copies Shared: read-only copies Exclusive: one read-write Busy: state in transition Each L2 cache line contains 4 data blocks of 8 words or 32 Bytes each (L1 data cache line size) • If a node (HUB) requests a data block, the corresponding presence bit is set and the state of that cache line is recorded • HUB runs the Cache Coherency protocol, updating the state of the data block and notifying nodes for which the presence bit is set. CC-NUMA Architecture: TM Programming Proc 1 Proc 2 Proc 3 i k i = j X k j C every processor holds a column of each matrix: C$distribute A(*,block),B(*,block),C(*,block) C$omp parallel do DO i=1,n DO j=1,n DO k=1,n C(i,j)=C(i,j) + A(i,k)*B(k,j) ENDDO ENDDO ENDDO • All data is shared • Additional optimization to place data close to the processor that would do most of the computations on that data • Automatic (compiler) optimizations for single processor and parallel performance • The data access (data exchange) is implicit in the algorithm; • Except for the additional data placement directives, the source is TM Problems of CC-NUMA Architecture SMP programming style + data placement techniques (directives) SMP programming Cliff remote memory latency jump ~3-5 requires correct data placement Based on 1 GB/s SCI link; latency/hop ~ 500 ns 64-128 processor O2000 ta(remote)/ta(local) ~3-5 ->correct data placement TM DSM-ccNUMA Memory Easy to Program Easy to Scale Hard to scale Hard to program Shared-memory Systems (SMP) Massively Parallel Systems (MPP) Distributed Shared Memory Systems [ccNUMA) Easy to Program Easy to Scale SGI 3200 (2-8p) TM Router-less configurations in deskside form factor Short Rack (17U config. space) C-Brick Network P P, I, or, X-Brick P P BR I-Brick P BR P I-Brick Network P P XIO+ P XIO+ C-Brick C-Brick C-Brick Power Bay Power Bay Minimum (2p) System Maximum (8p) System XIO+ Ports I-Brick XIO+ Ports P,I, or X-Brick System Topology SGI 3400 (4-32p) TM Full-size Rack (39U config. space) C-Brick P, I, or, X-Brick XIO+ I-Brick C-Brick P, I, or, X-Brick P P XIO+ P BR C-Brick P, I, or, X-Brick P P XIO+ P BR P P P XIO+ P BR P P P BR P P P C-Brick r-Brick P, I, or, X-Brick r-Brick 6-port router C-Brick P, I, or, X-Brick C-Brick r-Brick r-Brick P, I, or, X-Brick C-Brick P, I, or, X-Brick P C-Brick C-Brick I-Brick Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay P P P P BR P XIO+ Maximum (32p) System P BR C-Brick Minimum (4p) System r-Brick 6-port router P P P BR P P XIO+ System Topology P BR P XIO+ P P XIO+ SGI 3800 (16-128p) Rack 1 C C Rack 2 C C C Rack 3 C C C Rack 4 C C C 1 C C R R C C R R C C R R C C R R C R-Brick 8-port router C C C C C C C C C C C C 128P System Topology R-Brick C-Brick R-Brick C-Brick C-Brick C-Brick I-Brick Power Bay Power Bay Power Bay Power Bay Minimum (16p) System TM Maximum (128p) System 2 3 4 SGI 3800 System: 128 processors 16 proc 16 proc 16 proc 16 proc 16 proc 16 proc 16 proc 16 proc TM SGI 3800 (32-512p) P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick R-Brick R-Brick R-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick R-Brick R-Brick R-Brick R-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick R-Brick R-Brick R-Brick R-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick I-Brick C-Brick C-Brick C-Brick C-Brick Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay One Quadrant of a 512p System 512p Power Estimates: MIPS = 77 KW TM Itanium = 150 KW McKinley = 231 KW No I/O or storage included in power estimates. Premium memory required TM Router-to-Router Connections for 256 Processor Systems TM 512 Processor Systems TM TM R1xK Family of Processors MIPS R1x000 is an out-of-order, dynamic-scheduling superscalar processor with non-blocking caches •Supports the 64-bit MIPS IV ISA •4-way superscalar •Five separate execution units •2 floating point results / cycle •4-way deep speculative execution of branches •Out-of-order execution (48 instruction window) •Register re-naming •Two-way set associative non-blocking caches –Up to 4 outstanding memory read requests –Prefetching of data –1MB to 8MB secondary data cache •Four user-accessible event counters Origin 3000 MIPS Processor Roadmap TM R18000 xxx MHz, xxx GFlops R16000 xxx MHz, xxx GFlops R14000(A) 8 MB DDR SRAM@ 250+ MHz 500+ MHz, 1000+ MFlops Origin 2000 8 MB @ 266 MHz R12000A 400 MHz, 800 MFlops R12000 300 MHz, 600 MFlops R10000 250 MHz, 500 MFlops 1999 O3K-MIPS 8 MB @ 200 MHz 4 MB @ 250 MHz 2000 2001 2002 2003 TM R14000 Cache Interfaces Memory Hierarchy Cache subsystem memory disk 1400 1169 Origin3000 Latency 1200 1067 Origin2000 Latency Remote Latency (ns) ~2-3 cy ~10 cy 64reg Speed of Access 1/clock 1 0.1 TM 1000 836 759 800 554 600 585 400 343 435 200 32KB (L1) 759 175 175 2p 4p 235 285 335 335 32p 64p 485 0 8MB (L2) ~100 - 300 cy (NUMA) 8p 16p 0.01 ~4000 cy ~1 - 100s GB Device Capacity (size) 128p 256p 512p TM Effects of Memory Hierarchy 32 KB L1 cache 4 MB L1 cache L2 cache: 1MB cache 2MB cache 4MB cache TM Instruction Latencies (R12K) Integer units • ALU 1 latency Repeat rate – add, sub, logic ops, shift, br • ALU 2 1 1 – add, sub, logic ops 1 1 – signed multiply (32/64 bit) 6/10 6/10 – (unsigned multiply: +1 cycle) – divide (32/64 bit) 35/67 35/67 • Address Unit – – – – load integer 2 load floating point 3 store Atomic LL,ADD,SC sequence 6 1 1 1 6 Repeat rate of 1 means that after pipelining processor can complete 1 operation per cycle. Thus the peak rates: Int operations: 2 int operations/cycle FP operations: 2 fp operations/cycle For the R14000@500MHz: Floating point units • FPU 1 – add, sub, compare, convert 2 1 – multiply – multiply-add (madd) 2 4 1 1 – divide, reciprocal (32/64 bit) – sqrt (32/64 bit) – rsqrt (32/64 bit) 12/19 14/21 18/33 20/35 30/52 34/56 • FPU 2 • FPU 3 4*500 MHz = 2000 MIPS 2*500 MHz = 1000 Mflop/s Compiler has this table build in. The goal of compiler scheduling is finding instructions that can be executed in parallel to fill all slots: ILP - Instruction Level Parallelism TM Instruction Latencies: DAXPY Example DO I=1,n Y(I) = Y(I) + A*X(I) ENDDO Loop parallelism: 2 loads, 1 store 1 multiply-add (madd) 2 address increments 1 loop-end test 1 branch per single loop iteration Processor parallelism: 1 load or store 1 ALU1 instruction 1 ALU2 instruction 1 FP add 1 FP multiply per processor cycle – There are 2 loads (x,y) and 1 store (y)= 3 mem ops. – There are 2 fp operations (+,*) which can be done with 1 madd • 3 mem ops require 3 cycles minimum (processor can do 1 mem op/cycle) • • theoretically in 3 cycles processor can do 6 fp operations only 2 fp operations are available in the code max processor speed is 2fp/6fp=1/3 peak on this code; I.e. for the R12000@300MHz processor 600/3=200 Mflop/s. TM DAXPY Example: Schedules DO I=1,n Y(I) = Y(I) + A*X(I) ENDDO Simple schedule: cycle 0 1 2 3 4 5 6 7 instructions ld x ld y unrolled by 2: x++ madd st y br DO I=1,n-1,2 Y(I+0) = Y(I+0) + A*X(I+0) Y(I+1) = Y(I+1) + A*X(I+1) ENDDO y++ 2fp/(8cycles*2fp/cy)=1/8 peak R12000@300MHz ~ 75 Mflop/s cycle 0 1 2 3 4 5 6 7 8 instructions ld x0 ld x1 ld y0 x+=4 ld y1 st y0 st y1 y+=4 madd0 madd1 br 4fp/(9cycles*2fp/cy)=2/9 peak ~133 Mflop/s DAXPY Example: Software Pipelining TM • Software pipelining is the way to fill all processor slots by mixing iterations • replications gives how many iterations are mixed • number of replications depends on the distance (in cycles) between the load and the calculation #<swp> replication 0 ld x0 ldc1 $f0,0($1) ld x1 ldc1 $f1,-8($1) st y2 sdc1 $f3,-8($3) st y3 sdc1 $f5,0($3) y+=2 addiu $3,$2,16 madd.d $f5,$f2,$f0,$f4 ld y0 ldc1 $f0,-8($2) madd.d $f3,$f0,$f1,$f4 x+=2 addiu $1,$1,16 beq $2,$4,.BB21.daxpy ld y3 ldc1 $f2,0($3) #cy #[0] #[1] #[2] #[3] #[3] #[4] #[4] #[5] #[5] #[5] #[5] #<swp> replication 1 ld x3 ldc1 $f1,0($1) ld x2 ldc1 $f0,-8($1) st y1 sdc1 $f3,-8($2) st y0 sdc1 $f5,0($2) y+=2 addiu $2,$3,16 madd.d $f5,$f2,$f1,$f4 ld y3 ldc1 $f1,-8($3) madd.d $f3,$f1,$f0,$f4 x+=2 addiu $1,$1,16 ld y0 ldc1 $f2,0($2) #cy #[0] #[1] #[2] #[3] #[3] #[4] #[4] #[5] #[5] #[5] • DAXPY 6 cy schedule with 4 fp ops: 4fp/(6cy*2fp/cy)=1/3 peak TM DAXPY SWP: Compiler Messages F77 -mips4 -O3 -LNO:prefetch=0 -S daxpy.f • With the -S switch the compiler will produce file daxpy.s with assembler instructions and comments about software pipelining schedules #<swps> #<swps> #<swps> #<swps> #<swps> #<swps> #<swps> #<swps> #<swps> #<swps> #<swps> #<swps> #<swps> Pipelined loop line 6 steady state 50 estimated iterations before pipelining 2 unrolling before pipelining 6 cycles per 2 iterations 4 flops ( 33% of peak)(madds count 2fp) 2 flops ( 16% of peak)(madds count 1fp) 2 madds ( 33% of peak) 6 mem refs (100% of peak) 3 integer ops ( 25% of peak) 11 instructions ( 45% of peak) 2 short trip threshold 7 ireg registers used. 6 fgr registers used. • The schedule is the max 1/3 peak processor performance, as expected • note: it is necessary to switch off prefetch to attain max schedule TM Multiple Outstanding Mem Refs • Processor can support 4 outstanding memory requests Wait for data Wait for data Execution independent instructions Execution “Parallel” cache miss Execution Execution independent instructions “Sequential” cache miss Wait for data time Timing linked list references: while(x) x=x->p; #outstanding ref 1 2 4 time per 230 160 110 pointer fetch: ns (480 ns) ns (250 ns) ns (240 ns) TM Origin 3000 Memory Latency Local NI to NI Per Router ORIGIN O3K 320 ns 165 ns 105 ns 180 ns 50 ns 45 ns 485 ns + #hops*105 ns 230 ns + #hops*45 ns 32 CPU O3K Max Latency: 315 ns TM Remote Memory Latency SGI™ 3000 Family vs. SGI™ 2000 Series Worst case round trip remote latency (ns) 1400 1200 Origin2000 Latency SN (Hypercube) Origin 3000 Series 1000 800 600 400 200 0 2p 4p 8p 16p 32p 64p Node Size (CPUs) 128p 256p 512p 1024p TM R1x000 Event Counters The R1x000 processor family allows extensive performance monitoring with counters that can be triggered by 32 events: • R10000 has 2 event counters • R12000 has 4 event counters The counters are incremented when an event happens in the processor (e.g. cache miss) and the event is selected by the user. The first counter can be triggered by the events 0-15, the second counter is incremented in response to events 15-31. R12000 has 2 additional counters that allow to monitor conditional events (i.e. events based on previous events). User access to the counters is through a software library or shell level tools provided by the IRIX OS. TM Origin Address Space • Physically the memory is distributed and is not contiguous. 39 0 32 31 • Node id is assigned at boot time Node id 8 bits Node offset 32 bits (4 GB) • Logically memory is a shared single contiguous address space, the virtual address space is 44 bits (16 TB) 1 TB max Physical (40 bits) • The program (compiler) uses the Address virtual address space. 12 GB Empty slots • Translation from the virtual to the memory present physical address space is by the CPU. 8 GB Page 0 Page 1 Page 2 Page n TLB Page k Page 1 Page n Page 0 4 GB Max for a single node: 4 GB memory 0 0 1 2 3 4 ... Node id Page size is configurable as 16 KB (default), 64 KB, 256 KB, 1 MB, 4 MB, 16 MB Virtual Physical TLB = Translation Look-aside Buffer TM Process Scheduling Irix is a Symmetric Multiprocessing Operating System • Processes and Processors are independent • Parallel programs are executed as jobs with multiple processes • The Scheduler will allocate processes to processors 255 system 128 Real time 40 Time share 1 0 Priority range from 0 to 255 0 weightless (batch) 1-40 time share (interactive) (TS) 90-239 system (daemons and interrupts) 1-255 real time processes (FIFO & RR) TM Process Scheduling TM Process Scheduling TM Process Scheduling TM System Monitoring Commands uptime(1) w(1) sysmon ps(1) top, gr_top osview sar gr_osview gmemusage sysconf returns information about system usage and user load who is on the system and what are they doing? system log viewer a "snapshot" of the process table process table dynamic display system usage statistics system activity reporter system usage statistics in graphical form graphical memory usage monitor system limits, options, and parameters TM System Monitoring Commands ecstats -C ja oview pmchart nstats,linkstat bufview par numa_view, dlook limit [-h] R10K Counter Monitor job accounting statistics Performance Co-Pilot (bundled with IRIX) Performance Co-Pilot (licensed software) CrayLink connection statistics (man refcnt(5) ) system buffer statistics process activity report provides process memory placement information displays system soft [hard] limits TM System Monitoring Commands hinv topology hardware inventory system interconnect description TM Summary: Origin Properties • Single machine image – it behaves like a fat workstation • same compilers • time sharing – all your old code will run – OS schedules all the hardware resources on the machine • Processor scalability 2-512 cpu • I/O scalability 2-300 GB/s • All memory and I/O devices are directly addressable – no limitation on the size of a single program, it can use all the available memory – no limitation on the location of the data, all disks can be used in a single file system • 64 bit operating system and file system – HPC Features: Checkpoint/Restart, DMF, NQE/LSF, TMF, Miser, job limits, cpusets, enhanced accounting • Machine stability