SYSTEM PERFORMANCE ANALYSIS AND OPTIMIZATION (Ch.9 from Laplante, 1997) Recall that response time is the time between receipt of an interrupt and completion of all associated processing. Time-loading (utilization) is the percentage of time the CPU is doing “useful” processing. Memory-loading is the percentage of usable memory that is being used. RESPONSE TIME CALCULATION In general, the response time for task I, denoted Ri, is Ri=Li+Cs+Si+Ai where Li is the interrupt latency (nanoseconds), Cs is the context save time (microseconds), Si is the schedule time (microseconds), and Ai is the actual process time (milliseconds). For highest-priority tasks, the total interrupt latency can be computed as Li=Lp+max(LI, LD) where Lp is the interrupt latency to the propagation delay of the interrupt signal, LI is the longest completion time for an instruction in the interrupted process, and LD is the maximum time the interrupts are deliberately disabled by the lower-priority routine during, for example, context switching or buffer passing. For lower-priority tasks, interrupt cannot be processed until all higherpriority routines have been fully processed. In this case, Li=LH, where LH is the time needed to complete all higher-priority routines. Calculation of LH is difficult or impossible for most systems due to process i might be interrupted. TIME-LOADING AND ITS MEASUREMENT If Ti is the cycle time (or minimum time between occurrences) for cycle i, and Ai is the actual execution time, then time-loading (utilization) T for n tasks is: n T= i 1 Ai Ti SCHEDULING PROBLEMS AND ESTIMATIONS The most scheduling problems involving real-time systems are NP-complete (require exponential time for solving): 1. When there are mutual exclusion constraints, it is impossible to find a totally on-line optimal run-time scheduler 2. The problem of deciding whether it is possible to schedule a set of periodic tasks that use semaphores only to enforce mutual exclusion is NP-hard (even exponential-time solution is not known). 3. The multiprocessor scheduling problem with two processors, no resources, independent tasks, and arbitrary computation times is NPcomplete (for unit computation times it is polynomial). 4. The multiprocessor scheduling problem with three or more processors, one resource, independent tasks, unit computation times of each task is NP-complete. REDUCING RESPONSE TIMES AND TIME-LOADING 1. Compute at Slowest Cycle All processing should be made at the slowest rate that can be tolerated. Checking a temperature discrete for a large room at faster than 1 second may be wasteful. 2. Scaled Arithmetic Integer operations are typically faster than floating point operations for most computers. We can take advantage of that fact in certain systems by multiplying integers by a scale factor to simulate floating point operations. This solution was one of the first methods for implementing real-number operations in the early computers. Here a two’s complement number is used, the LSB (least significant bit) of which is assigned a scale factor, which is sometimes called granularity of the number. If the number is an n-bit two’s complement integer, then the MSB (most significant bit) of the number acts like a sign bit. The largest number that can be represented this way is (2 n1 1) LSB and the smallest number that can be represented is 2 n 1 LSB Example: Consider the aircraft navigation system in which x, y, and z accelerometer pulses are converted into actual accelerations by applying the scale factor of 0.01. The 16-bit number 0000 0000 0001 0011 then represents a delta velocity of 19x0.01 = 0.19 feet per second. The largest and smallest delta velocities that can be represented in this scheme are 327.67 and -327.68 feet per second, respectively. Scaled numbers can be added and subtracted together, and multiplied and divided by a constant (but not another scaled number), as signed integers. Thus, computations, involving such numbers, can be performed in integer form and then converted to floating point only at the last step. 3. Look-Up Tables Look-up tables rely on the mathematical definition of the derivative: f ( x) lim x 0 f ( x x) f ( x) x Generic look-up table is an array of pre computed values of f for various x taken with the step x . All intermediate values can be interpolated as follows: f ( x ) f ( x) ( x x) f ( x x) f ( x) x The choice of x represents a tradeoff between the size of the table and the desired resolution of the function. BASIC OPTIMIZATION THEORY 1. Use of Arithmetic Identities For example, multiplication by the constant “1” or addition with 0 should be eliminated from the executable 2. Reduction in Strength This method refers to the use of fastest macroinstruction to accomplish a given calculation. For example, many compilers will replace multiplication of an integer by another integer that is a power of 2, by a series of shift instructions. Divide instructions usually take longer to execute than multiply instructions. Hence, it may be better to multiply by the reciprocal of the number than to divide by that number. For example, x*0.5 will be faster than x/2.0. 3. Common Sub-Expression Elimination The following Pascal fragment X:=y+a*b; Y:=a*b+z; could be replaced with: t:=a*b; x:=y+t; y:=t+z; eliminating the additional multiplication. 4. Intrinsic Functions When possible, use intrinsic functions rather than ordinary functions. Intrinsic functions are simply macros where the actual function call is replaced by in-line code during compilation: # define max(A,B) ((A)>(B)?(A):(B)) This improves real-time performance because the need to pass parameters, create space for local variables, and release that space is eliminated. 5. Constant Folding The statement X:=2.0*x*4.0; could be optimized by folding 2.0*4.0 to 8.0. Although original statement may be more descriptive, a comment can be provided to explain the optimized expression. Also, mnemonic names can be used. For example, in the case of use of / 2 , it can be pre computed and stored as a constant named pi_div_2. 6. Loop Invariant Optimization Consider the following Pascal fragment: X:=100; While x>0 do X:=x-(y+z); It can be replaced by X:=100; T:=y+z; While x>0 do X:=x-t; This moves an instruction outside the loop (decreases time) but increases memory requirements. 7. Loop Induction Elimination A variable I is called an induction variable of a loop if every time when loop variable changes, I is incremented or decremented by some constant. Consider the following Pascal fragment: For i:=1 to 10 do A[i+1]:=1; An improved version is For j:=2 to 11 do A[j]:=1; eliminating the extra addition within the loop. 8. Use of Registers and Caches When programming in assembly language, or when using languages that support register-type variables, such as C, it is usually advantageous to perform calculations using registers: f(register unsigned m, register long n){ register int i; .. Although most optimizing compilers will cache variables, where possible, the nature of the source-level code affects compiler’s abilities. 9. Removal of Dead or Unreachable Code For example, instead of If(debug){ .. } Better to use #ifdef DEBUG { .. } #endif 10.Flow Control Optimization The following pseudo code Goto label11; Label10: y=1; Label11: goto label12; Can be replaced by Goto label12; Label10: y=1; Label11: goto label12; Such code is not normally generated by programmers but might result from automatic generation. 11.Constant Propagation Certain variable assignment statements can be changed to constant assignments, thereby permitting time saving. For example: X:=100; Y:=x; Is implemented in 2-address assembly by a non-optimizing compiler as LOAD R1, 100 STORE R1,x LOAD R1,x STORE R1,y This could be replaced by X:=100; Y:=100; With associated 2-address assembly output: LOAD R1, 100 STORE R1,x STORE R1,y 12.Dead-Store Elimination The following Pascal code illustrates dead-store: T:=y+z; X:=func(t); This could be replaced by X:=func(y+z); if t is not used in other statements, 13.Dead Variable Elimination A variable is live at point in a program if its value can be used subsequently; otherwise it is dead and subject to removal. The following Pascal code illustrates that x is a dead variable: Int main(){ Int x; Return 1; } After dead variable elimination: Int main(){ return 1; } 14.Short-Circuiting Boolean Code If (x>0) and (y>0) then Z:=1; If x<=0 then there is no need to check y>0: If x>0 then If y>0 then Z:=1; 15.Loop Unrolling It duplicates statements in order to reduce the number of loop iterations and hence the loop overhead incurred: For i:=1 to do a[i]:=a[i]*8; may be replaced by: for i:=1 to 6 step 3 do begin a[i]:=a[i]*8; a[i+1]:=a[i+1]*8; a[i+2]:=a[i+2]*8 end; 16.Loop Jamming It is also called loop fusion. It combines similar loops into one: For i:=1 to 100 do X[i]:=y[i]*8; For i:=1 to 100 do Z[i]:=x[i]*y[i]; Is converted to For i:=1 to 100 do begin X[i]:=y[i]*8; Z[i]:=x[i]*y[i] End; 17.Cross Jump Elimination If the same code appears in more than one case in a case statement, then such cases can be combined: Case x of 0: x:=x+1; break; 1: x:=x*2; break; 2: x:=x+1; break; 3: x:=2; End; Can be replaced by Case x of 0,2: x:=x+1; break; 1: x:=x*2; break; 3: x:=2; End; OTHER OPTIMIZATION TECHNIQUES 1. Optimize the most frequently used path 2. Arrange a series of IF statements so that the most likely to fail condition is tested first 3. Arrange a series of AND conditions so that most likely to fail condition is tested first (in the case of OR conditions – so that most likely to succeed condition is tested first) 4. Arrange entries in the table so that the most frequently sought values are the first to be compared 5. Replace threshold tests on monotone (continuously nondeacreasing or nonincreasing) functions by tests on their parameters. For example, instead of If exp(x)<exp(y) then Use If x<y then 6. Link the most frequently used procedures together to maximize the locality of reference (for paged and cached systems) 7. Store data elements that are used concurrently together (to increase locality of reference) 8. Store procedures in sequence so that calling and called procedures will be loaded together (to increase locality of reference) ANALYSIS OF MEMORY REQUIREMENTS Memory is considered as stack, program and RAM areas. The total memory loading is a weighted sum of individual memoryloading for program, stack and RAM: MT=MP*PP+MR*PR+MS*PS, Where MP, MR, MS are the memory loading of program, RAM nd stack parts, and PP, PR, PS are percentages of total memory allocated for the program, RAM and stack areas. For example, computer system has 64 Mb of program memory that is loaded at 75%, 24 Mb of RAM that is loaded at 25%, and 12 Mb of stack area that is loaded at 50%. The total memory loading is MT=0.75*64/100+0.25*24/100+0.50*12/100=60% MP=UP/TP Where UP is the number of locations used in the program area, and TP is the total available locations in the program area. MR=UR/TR Where UR is the number of locations used in the RAM area, and TR is the total available number of locations in RAM. Numbers UP, TP, UR, TR are available from the linker. MS=US/TS Where TS is the total available number of locations in the stack area, US=CS*Tmax Where CS is number of locations allocated for one task, and Tmax is the maximal number of tasks which can reside simultaneously in the stack area. REDUCING MEMORY-LOADING It may be achieved by proper choice of target area for variables, reuse of variables, and by use of self-modifying (that is dangerous and is not allowed in many cases). QUEUEING MODELS Basic Buffer Size Calculation If the data are produced at a rate P(t) and consumed at a rate C(t), then if burst of data takes place for period T, buffers size can be calculated as B=(P-C)*T If P and C are constants If they are functions of time and burst of data takes place between t1 and t2, then buffer size is T B max ( P(t ) C (t )) dt T t 2 t1 t1 If rates of production and consumption are random values with some distribution, we come to queuing model X/Y/n Where X denotes arrival time probability function, Y is service time probability function, n is a number of servers. For example, n is a number of processors, X is a distribution for times if arising interruptions, Y is distribution of time of handling of interruptions by respective processes. Hence, M/M/1 denotes system with one processor, serving interruptions arising according to exponential distribution by processes requiring also exponentially distributed times. Exponential distribution is given by f (t ) e t It means that probability of arising of new interruption at time instant T inside time interval (a,b) is b p(a T b) f (t )dt e a e b a Value of 1 gives average time between two consecutive interruptions, is the average rate of arising of interrupts. Respective mean and service time we denote by 1 . Not to have infinite number of interrupts in the queue we require It means that mean rate of interrupts is less than rate of serving them. If to denote , then the average number of customers (interrupts) in the queue is given by N 1 (1) With variance N2 (1 ) 2 (2) For random value x, average value x is defined as x xf ( x)dx and variance is x2 ( x x ) 2 f ( x)dx The average time customer (interruption) spends in the system is T 1/ (3) 1 Probability that at least k customers are in the queue is P( Noofcustomers k ) k (4) Buffer size for waiting interruptions is defined by average size of queue N (1). With the help of (1), (2) we can decide what maximal size of buffer can be used. Expression (3) allows to evaluate response times. Expression (4) can be used to decide on system parameters providing that in the system there will be not more than specified number of pending requests. LITTLE’S LAW It states (appeared in 1961) that the average number customers in a queuing system, N av , is equal to the average arrival rate of the customers to that system, rav , times the average time spent in that system, t av , N av rav t av If n servers are present then n N av ri ,av t i ,av i 1 Where ri ,av is average arrival rate of customers to the i-th server, and t i ,av is the average service time for server i. For example, a system is known to have periodic interrupts occurring at 10, 20 and 100 milliseconds and a sporadic interrupt that is known to occur on average every 1 second. The average processing time for these interrupts is 3, 8, 25, and 30 milliseconds. Then by Little’s law the average number of customers in the queue is N av 1 1 1 1 3 8 25 30 0.98 10 20 100 1000