Lecture 12, Slide 1 Computer Architecture Vector Computers Lecture 12, Slide 2 contents 1. Why Vector Processors? 2. Basic Vector Architecture 3. How Vector Processors Work 4. Vector Length and Stride 5. Effectiveness of Compiler Vectorization 6. Enhancing Vector Performance 7. Performance of Vector Processors Lecture 12, Slide 3 Vector Processors I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI (ASC) processor. Those three were all pioneering processors. . . . One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made. Seymour Cray Public lecture at Lawrence Livermore Laboratories on the introduction of the Cray-1 (1976) Lecture 12, Slide 4 Supercomputers Definition of a supercomputer: • Fastest machine in world at given task • A device to turn a compute-bound problem into an I/O bound problem • Any machine costing $30M+ • Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer Lecture 12, Slide 5 Supercomputer Applications Typical application areas • Military research (nuclear weapons, cryptography) • Scientific research • Weather forecasting • Oil exploration • Industrial design (car crash simulation) All involve huge computations on large data sets In 70s-80s, Supercomputer Vector Machine Lecture 12, Slide 6 1. Why Vector Processors? • A single vector instruction specifies a great deal of work—it is equivalent to executing an entire loop. • The computation of each result in the vector is independent of the computation of other results in the same vector and so hardware does not have to check for data hazards within a vector instruction. • Hardware need only check for data hazards between two vector instructions once per vector operand, not once for every element within the vectors. • Vector instructions that access memory have a known access pattern. • Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent. Lecture 12, Slide 7 2. Basic Vector Architecture • There are two primary types of architectures for vector processors: vector-register processors and memory-memory vector processors. – In a vector-register processor, all vector operations—except load and store—are among the vector registers. – In a memory-memory vector processor, all vector operations are memory to memory. Vector Memory-Memory versus Vector Register Machines Lecture 12, Slide 8 • Vector memory-memory instructions hold all vector operands in main memory • The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines • Cray-1 (’76) was first vector register machine Vector Memory-Memory Code Example Source Code for (i=0; i<N; i++) { C[i] = A[i] + B[i]; D[i] = A[i] - B[i]; } ADDV C, A, B SUBV D, A, B Vector Register Code LV V1, A LV V2, B ADDV V3, V1, V2 SV V3, C SUBV V4, V1, V2 SV V4, D Vector Memory-Memory vs. Vector Register Machines Lecture 12, Slide 9 • Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why? – All operands must be read in and out of memory • VMMAs make if difficult to overlap execution of multiple vector operations, why? – Must check dependencies on memory addresses • VMMAs incur greater startup latency – Scalar code was faster on CDC Star-100 for vectors < 100 elements – For Cray-1, vector/scalar breakeven point was around 2 elements Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures (we ignore vector memory-memory from now on) The basic structure of a vector-register architecture VMIPS Lecture 12, Slide 10 Lecture 12, Slide 11 Primary Components of VMIPS • Vector registers — VMIPS has eight vector registers, and each holds 64 elements. Each vector register must have at least two read ports and one write port. • Vector functional units — Each unit is fully pipelined and can start a new operation on every clock cycle. • Vector load-store unit —The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory with a bandwidth of 1 word per clock cycle, after an initial latency. • A set of scalar registers —Scalar registers can also provide data as input to the vector functional units, as well as compute addresses to pass to the vector load-store unit. Lecture 12, Slide 12 Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions • • • • • • • • Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory Cray-1 (1976) Lecture 12, Slide 13 Cray-1 (1976) 64 Element Vector Registers Single Port Memory 16 banks of 64bit words + 8-bit SECDED ( (Ah) + j k m ) (A0) 64 T Regs Si Tjk V0 V1 V2 V3 V4 V5 V6 V7 S0 S1 S2 S3 S4 S5 S6 S7 Lecture 12, Slide 14 Vi V. Mask Vj V. Length Vk FP Add Sj FP Mul Sk FP Recip Si Int Add Int Logic Int Shift 80MW/sec data load/store ( (Ah) + j k m ) (A0) 320MW/sec instruction buffer refill 64 B Regs Ai Bjk A0 A1 A2 A3 A4 A5 A6 A7 64-bitx16 NIP 4 Instruction Buffers LIP memory bank cycle 50 ns Pop Cnt Aj Ak Ai Addr Add Addr Mul CIP processor cycle 12.5 ns (80MHz) Lecture 12, Slide 15 Vector Programming Model Scalar Registers r7 r0 Vector Registers v7 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register Vector Arithmetic Instructions ADDV v3, v1, v2 v1 v2 + + [0] [1] + + + + v3 Vector Load and Store Instructions LV v1, r1, r2 Base, r1 VLR Stride, r2 v1 [VLR-1] Vector Register Memory Lecture 12, Slide 16 • In VMIPS, vector operations use the same names as MIPS operations, but with the letter “V” appended. Lecture 12, Slide 17 Vector Code Example # C code for (i=0; i<64; i++) C[i] = A[i] + B[i]; # Scalar Code LI R4, 64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3 Lecture 12, Slide 18 Vector Instruction Set Advantages • Compact – one short instruction encodes N operations • Expressive, tells hardware that these N operations: – – – – – – are independent use the same functional unit access disjoint registers access registers in the same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) access memory in a known pattern (strided load/store) • Scalable – can run same object code on more parallel pipelines or lanes 3. How Vector Processors Work Lecture 12, Slide 19 3.1 An Example • Let’s take a typical vector problem, X and Y are vectors, a is a scalar. • Y = a×X + Y • This is the socalled SAXPY or DAXPY loop that forms the inner loop of the Linpack benchmark. • Example Show the code for MIPS and VMIPS for the DAXPY loop. • Assume that the starting addresses of X and Y are in Rx and Ry. And the number of elements, or length, of a vector register(64) matches the length of the vector operation. Lecture 12, Slide 20 • Here is the MIPS code. L.D F0,a DADDIU R4,Rx,#512 Loop: L.D F2,0(Rx) MUL.D F2,F2,F0 L.D F4,0(Ry) ADD.D F4,F4,F2 S.D 0(Ry),F4 DADDIU Rx,Rx,#8 DADDIU Ry,Ry,#8 DSUBU R20,R4,Rx BNEZ R20,Loop ;load scalar a ;last address to load ;load X(i) ;a × X(i) ;load Y(i) ;a × X(i) + Y(i) ;store into Y(i) ;increment index to X ;increment index to Y ;compute bound ;check if done • Here is the VMIPS code for DAXPY. L.D F0,a ;load scalar a LV V1,Rx ;load vector X MULVS.D V2,V1,F0 ;vector-scalar multiply LV V3,Ry ;load vector Y ADDV.D V4,V2,V3 ;add SV Ry,V4 ;store the result Lecture 12, Slide 21 • The most dramatic comparison is that the vector processor greatly reduces the dynamic instruction bandwidth. • Another important difference is the frequency of pipeline interlocks. (Pipeline stalls are required only once per vector operation, rather than once per vector element.) Vector Arithmetic Execution • Use deep pipeline (=> fast clock) to execute element operations • Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) V1 V2 Lecture 12, Slide 22 V3 Six stage multiply pipeline V3 <- v1 * v2 3.2 Vector Load-Store Units and Vector Memory Systems Operation Start-up penalty Vector add Vector multiply Vector divide Vector load / store 6 7 20 12 Lecture 12, Slide 23 Start-up penalties (in clock cycles) on VMIPS To maintain an initiation rate of 1 word fetched or stored per clock, the memory system must be capable of producing or accepting this much data. This is usually done by spreading accesses across multiple independent memory banks. Lecture 12, Slide 24 Vector Memory System Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency • Bank busy time: Cycles between accesses to same bank Base Stride Vector Registers Address Generator 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Banks + Lecture 12, Slide 25 • Example Suppose we want to fetch a vector of 64 elements starting at byte address 136, and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the CPU? • Answer Six clocks per access require at least six banks, but because we want the number of banks to be a power of two, we choose to have eight banks. Figure on next page shows the timing for the first few sets of accesses for an eight-bank system with a 6-clock-cycle access latency. Lecture 12, Slide 26 The CPU cannot keep all eight banks busy all the time because it is limited to supplying one new address and receiving one data item each cycle. 4. Two Real-World Issues: Vector Length and Stride Lecture 12, Slide 27 • What do you do when the vector length in a program is not exactly 64? • How do you deal with nonadjacent elements in vectors that reside in memory? 4.1 Vector-Length Control 10 do 10 i = 1,n Y(i) = a × X(i) + Y(i) n may not even be known until run time Lecture 12, Slide 28 • The solution is to create a vector-length register (VLR), which controls the length of any vector operation. • The value in the VLR, however, cannot be greater than the length of the vector registers — maximum vector length (MVL). • If the vector is longer than the maximum length, a technique called strip mining is used. Vector Stripmining Lecture 12, Slide 29 Problem: Vector registers have finite length Solution: Break loops into pieces that fit into vector registers, “Stripmining” ANDI R1, N, #63 ; N mod 64 MTC1 VLR, R1 ; Do remainder for (i=0; i<N; i++) loop: C[i] = A[i]+B[i]; LV V1, RA A B C DSLL R2, R1, #3 ; Multiply by 8 Remainder + DADDU RA, RA, R2 ; Bump pointer LV V2, RB DADDU RB, RB, R2 64 elements ADDV.D V3, V1, V2 + SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 ; Subtract elements + LI R1, #64 MTC1 VLR, R1 ; Reset full length BGTZ N, loop ; Any more to do? 4.2 Vector Stride Lecture 12, Slide 30 do 10 i = 1,100 do 10 j = 1,100 A(i,j) = 0.0 do 10 k = 1,100 10 A(i,j) = A(i,j)+B(i,k)*C(k,j) At the statement labeled 10 we could vectorize the multiplication of each row of B with each column of C. When an array is allocated memory, it is linearized and must be laid out in either row-major or column-major order. This linearization means that either the elements in the row or the elements in the column are not adjacent in memory. Lecture 12, Slide 31 Vector Stride This distance separating elements that are to be gathered into a single register is called the stride. • The vector stride, like the vector starting address, can be put in a general-purpose register. • Then the VMIPS instruction LVWS (load vector with stride) can be used to fetch the vector into a vector register. • Likewise, when a nonunit stride vector is being stored, SVWS (store vector with stride) can be used. Lecture 12, Slide 32 5. Effectiveness of Compiler Vectorization • Two factors affect the success with which a program can be run in vector mode. • The first factor is the structure of the program itself. This factor is influenced by the algorithms chosen and by how they are coded. • The second factor is the capability of the compiler. Automatic Code Vectorization Lecture 12, Slide 33 for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vectorized Code Scalar Sequential Code load load Iter. 1 add store load load Iter. 2 add store load load Time load Iter. 1 load add add store store Iter. 2 Vector Instruction Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis Lecture 12, Slide 34 6. Enhancing Vector Performance In this section we present five techniques for improving the performance of a vector processor. • • • • • Chaining Conditionally Executed Statements Sparse Matrices Multiple Lanes Pipelined Instruction Start-Up Lecture 12, Slide 35 (1) Vector Chaining • the Concept of Forwarding Extended to Vector Registers • Vector version of register bypassing – introduced with Cray-1 LV v1 V1 MULV v3,v1,v2 V2 V3 V4 ADDV v5, v3, v4 Chain Load Unit Memory Chain Mult. Add V5 Lecture 12, Slide 36 Vector Chaining Advantage • Without chaining, must wait for last element of result to be written before starting dependent instruction Load Mul Time Add • With chaining, can start dependent instruction as soon as first result appears Load Mul Add Lecture 12, Slide 37 Implementations of Chaining • Early implementations worked like forwarding, but this restricted the timing of the source and destination instructions in the chain. • Recent implementations use flexible chaining, which requires simultaneous access to the same vector register by different vector instructions, which can be implemented either by adding more read and write ports or by organizing the vector-register file storage into interleaved banks in a similar way to the memory system. (2) Vector Conditional Execution Lecture 12, Slide 38 Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i]; Solution: Add vector mask (or flag) registers – vector version of predicate registers, 1 bit per element …and maskable vector instructions – vector operation becomes NOP at elements where mask bit is clear Code example: CVM LV VA, RA L.D F0,#0 SGTVS.D VA, F0 LV VA, RB SV VA, RA ; ; ; ; ; ; Turn on all elements Load entire A vector Load FP zero into F0 Set bits in mask register where A>0 Load B vector into A under mask Store A back to memory under mask Lecture 12, Slide 39 Masked Vector Instructions Simple Implementation – execute all N operations, turn off result writeback according to mask Density-Time Implementation – scan mask vector and only execute elements with non-zero masks M[7]=1 A[7] B[7] M[7]=1 M[6]=0 A[6] B[6] M[6]=0 M[5]=1 A[5] B[5] M[5]=1 M[4]=1 A[4] B[4] M[4]=1 M[3]=0 A[3] B[3] M[3]=0 C[5] M[2]=0 C[4] M[2]=0 C[2] M[1]=1 C[1] A[7] B[7] M[1]=1 M[0]=0 C[1] Write data port M[0]=0 Write Enable C[0] Write data port Compress/Expand Operations Lecture 12, Slide 40 • Compress packs non-masked elements from one vector register contiguously at start of destination vector register – population count of mask vector gives packed vector length • Expand performs inverse operation M[7]=1 A[7] A[7] A[7] M[7]=1 M[6]=0 A[6] A[5] B[6] M[6]=0 M[5]=1 A[5] A[4] A[5] M[5]=1 M[4]=1 A[4] A[1] A[4] M[4]=1 M[3]=0 A[3] A[7] B[3] M[3]=0 M[2]=0 A[2] A[5] B[2] M[2]=0 M[1]=1 A[1] A[4] A[1] M[1]=1 M[0]=0 A[0] A[1] B[0] M[0]=0 Compress Expand Used for density-time conditionals and also for general selection operations Lecture 12, Slide 41 (3) Sparse Matrices Lecture 12, Slide 42 Vector Scatter/Gather Want to vectorize loops with indirect accesses: (index vector D designate the nonzero elements of C) for (i=0; i<N; i++) A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) LV LVI LV ADDV.D SV VD, RD VC,(RC, VD) VB, RB VA, VB, VC VA, RA ; ; ; ; ; Load indices in D vector Load indirect from RC base Load B vector Do add Store result Lecture 12, Slide 43 Vector Scatter/Gather Scatter example: for (i=0; i<N; i++) A[B[i]]++; Is following a correct translation? LV LVI ADDV SVI VB, RB ; Load indices in B vector VA,(RA, VB) ; Gather initial A values VA, RA, 1 ; Increment VA,(RA, VB) ; Scatter incremented values Lecture 12, Slide 44 (4) Multiple Lanes ADDV C,A,B Execution using one pipelined functional unit Vector Instruction Execution Execution using four pipelined functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] Lecture 12, Slide 45 Vector Unit Structure Functional Unit Vector Registers Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Lane Memory Subsystem Elements 3, 7, 11, … T0 Vector Microprocessor (1995) Vector register elements striped over lanes Lecture 12, Slide 46 Lane [24] [25] [16] [17] [8] [9] [0] [1] [26] [18] [10] [2] [27] [28] [29] [30] [31] [19] [20] [21] [22] [23] [11] [12] [13] [14] [15] [3] [4] [5] [6] [7] Lecture 12, Slide 47 Vector Instruction Parallelism Can overlap execution of multiple vector instructions – example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle Lecture 12, Slide 48 (5) Pipelined Instruction Start-Up • The simplest case to consider is when two vector instructions access a different set of vector registers. • For example, in the code sequence ADDV.D V1,V2,V3 ADDV.D V4,V5,V6 • It becomes critical to reduce start-up overhead by allowing the start of one vector instruction to be overlapped with the completion of preceding vector instructions. • An implementation can allow the first element of the second vector instruction to immediately follow the last element of the first vector instruction down the FP adder pipeline. Vector Startup Lecture 12, Slide 49 Two components of vector startup penalty – functional unit latency (time through pipeline) – dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W R X X X W R X X X First Vector Instruction Dead Time Dead Time Second Vector Instruction W Lecture 12, Slide 50 Dead Time and Short Vectors No dead time 4 cycles dead time 64 cycles active Cray C90, Two lanes 4 cycle dead time Maximum efficiency 94% with 128 element vectors T0, Eight lanes No dead time 100% efficiency with 8 element vectors Lecture 12, Slide 51 • Example The Cray C90 has two lanes but requires 4 clock cycles of dead time between any two vector instructions to the same functional unit. For the maximum vector length of 128 elements, what is the reduction in achievable peak performance caused by the dead time? What would be the reduction if the number of lanes were increased to 16? • Answer A maximum length vector of 128 elements is divided over the two lanes and occupies a vector functional unit for 64 clock cycles. The dead time adds another 4 cycles of occupancy, reducing the peak performance to 64/(64 + 4) = 94.1% of the value without dead time. • If the number of lanes is increased to 16, maximum length vector instructions will occupy a functional unit for only 128/16 = 8 cycles, and the dead time will reduce peak performance to 8/(8 + 4) = 66.6% of the value without dead time. 7. Performance of Vector Processors Vector Execution Time The execution time of a sequence of vector operations primarily depends on three factors: • the length of the operand vectors • structural hazards among the operations • data dependences Lecture 12, Slide 52 Lecture 12, Slide 53 Convoy and Chime • Convoy is the set of vector instructions that could potentially begin execution together in one clock period. – The instructions in a convoy must not contain any structural or data hazards; if such hazards were present, the instructions in the potential convoy would need to be serialized and initiated in different convoys. • A chime is the unit of time taken to execute one convoy. – A chime is an approximate measure of execution time for a vector sequence; a chime measurement is independent of vector length. – A vector sequence that consists of m convoys executes in m chimes, and for a vector length of n, this is approximately m × n clock cycles. Lecture 12, Slide 54 • Example Show how the following code sequence lays out in convoys, assuming a single copy of each vector functional unit: LV MULVS.D LV ADDV.D SV V1,Rx ;load vector X V2,V1,F0 ;vector-scalar multiply V3,Ry ;load vector Y V4,V2,V3 ;add Ry,V4 ;store the result • How many chimes will this vector sequence take? • How many cycles per FLOP (floating-point operation) are needed, ignoring vector instruction issue overhead? Lecture 12, Slide 55 • Answer The first convoy is occupied by the first LV instruction. The MULVS.D is dependent on the first LV, so it cannot be in the same convoy. The second LV instruction can be in the same convoy as the MULVS.D. The ADDV.D is dependent on the second LV, so it must come in yet a third convoy, and finally the SV depends on the ADDV.D, so it must go in a following convoy. 1. LV 2. MULVS.D LV 3. ADDV.D 4. SV • The sequence requires four convoys and hence takes four chimes. Since the sequence takes a total of four chimes and there are two floating-point operations per result, the number of cycles per FLOP is 2 (ignoring any vector instruction issue overhead). Lecture 12, Slide 56 Start-up overhead • The most important source of overhead ignored by the chime model is vector start-up time. • The start-up time comes from the pipelining latency of the vector operation and is principally determined by how deep the pipeline is for the functional unit used. Unit Load and store unit Multiply unit Add unit Start-up overhead (cycles) 12 7 6 Lecture 12, Slide 57 • Example Assume the start-up overhead for functional units is shown in Figure of the previous page. • Show the time that each convoy can begin and the total number of cycles needed. How does the time compare to the chime approximation for a vector of length 64? • Answer The time per result for a vector of length 64 is 4 + (42/64) = 4.65 clock cycles, while the chime approximation would be 4. Lecture 12, Slide 58 Running Time of a Strip-mined Loop There are two key factors that contribute to the running time of a strip-mined loop consisting of a sequence of convoys: 1. The number of convoys in the loop, which determines the number of chimes. We use the notation Tchime for the execution time in chimes. 2. The overhead for each strip-mined sequence of convoys. This overhead consists of the cost of executing the scalar code for strip-mining each block, Tloop, plus the vector startup cost for each convoy, Tstart. • the total running time for a vector sequence operating on a vector of length n: Lecture 12, Slide 59 • Example What is the execution time on VMIPS for the vector operation A = B × s, where s is a scalar and the length of the vectors A and B is 200? • Answer – Assume the addresses of A and B are initially in Ra and Rb, s is in Fs, and recall that for MIPS (and VMIPS) R0 always holds 0. – The first iteration of the strip-mined loop will execute for a vector length of (200 mod 64) = 8 elements, and the following iterations will execute for a vector length of 64 elements. – Since the vector length is either 8 or 64, we increment the address registers by 8 × 8 = 64 after the first segment and 8 × 64 = 512 for later segments. The total number of bytes in the vector is 8 × 200 = 1600, and we test for completion by comparing the address of the next vector segment to the initial address plus 1600. Here is the actual code: Loop: DADDUI DADDU DADDUI MTC1 DADDUI DADDUI LV MULVS.D SV DADDU DADDU DADDUI MTC1 DSUBU BNEZ R2,R0,#1600 R2,R2,Ra R1,R0,#8 VLR,R1 R1,R0,#64 R3,R0,#64 V1,Rb V2,V1,Fs Ra,V2 Ra,Ra,R1 Rb,Rb,R1 R1,R0,#512 VLR,R3 R4,R2,Ra R4,Loop Lecture 12, Slide 60 ;total # bytes in vector ;address of the end of A vector ;loads length of 1st segment ;load vector length in VLR ;length in bytes of 1st segment ;vector length of other segments ;load B ;vector * scalar ;store A ;address of next segment of A ;address of next segment of B ;load byte offset next segment ;set length to 64 elements ;at the end of A? ;if not, go back Lecture 12, Slide 61 • The three vector instructions in the loop are dependent and must go into three convoys, hence Tchime = 3. • Use our basic formula: • The value of Tstart is given by Tstart = 12 + 7 + 12 = 31 • So, the overall value becomes T200 = 660 + 4 × 31= 784 • The execution time per element with all start-up costs is then 784/200 = 3.9, compared with a chime approximation of three.