Deferred segment-loading An exercise on implementing the concept of ‘load-on-demand’ The ‘do-it-later’ philosophy • Modern operating systems often follow a policy of deferring work whenever possible • The advantage of adopting this practice is most evident in those cases where it turns out that the work was not needed after all • Example: Many programs contain lots of code and data for diagnosing errors – but it’s not needed if no errors actually occur Avoiding wasted effort • Thus it will be more efficient if an OS does not always take time to load those portions of a program (such as its error-diagnostics and error-recovery routines) which may be unnecessary in the majority of situations • But of course the OS needs to be ready to take a ‘timeout’ for loading those routines when and if the need becomes apparent Another example • In a multitasking environment, many tasks are taking turns at executing instructions • The CPU typically performs task-switching several times every second – and must do a ‘save’ of the outgoing task’s context, and a ‘load’ of the incoming task’s context, any time it switches from one task to the next • We ask: can any of this work be deferred? The NPX registers • Only a few tasks typically make any use of the Pentium’s ‘floating-point’ registers, so it’s wasteful to do a ‘save-and-reload’ for these registers with every task-switch • The TS-bit (bit #3 in Control Register 0) is designed to assist an OS in implementing a policy of ‘lazy’ context-switching for the set of registers used in floating-point work Example: effect of TS=1 • Each time the CPU performs a task-switch it automatically sets the TS-bit to 1 (only an OS can execute a ‘clts’ to reset TS=0) • When any task tries to execute any of the NPX instructions (to do some arithmetic with values in the floating-point registers), an exception 7 fault will occur if the TS-bit hasn’t been cleared since a task-switch The fault-7 exception-handler • The work involved in saving the contents of the floating-point registers being used by a no-longer-active task, and reloading those registers with values that the active task expects to work on, can be deferred to the fault-handler for exception-7 • Then it can clear the TS-bit (with ‘clts’) and ‘retry’ the instruction that caused this ‘fault’ The ‘fork()’ system-call • In a UNIX/Linux operating system, the way any new task get created is by a call to the kernel’s ‘fork()’ service-function • This function is supposed to ‘duplicate’ the entire program-environment of the calling task (i.e., code, data, stack and heap, plus the kernel’s process-control data-structure • But much of this work is often wasted! The ‘fork-and-exec’ senario • In practice, the most common reason for a program to ‘fork()’ a child-process is so the child-task can launch a separate program: if ( fork() == 0 ) execl( “newprog”, newargs, 0 ); • In these cases the ‘duplicated’ code, data, and heap are not relevant to the new task -- and so they will simply get discarded! ‘loading-on-demand’ • An OS can avoid all the wasted effort of duplicating a parent-task’s resources (its code, data, heap, etc.) by implementing “only upon demand” loading as a policy • For an OS that uses the CPU’s memorysegmentation capabilities, an ‘on demand’ policy can be implemented by using the Pentium ‘Segment-Not-Present’ exception How it works • Segments remain ‘uninitialized’ until they are actually accessed by an application • Segment-descriptors are initially marked as ‘Not Present’ (i.e., their P-bit is zero) • When any instruction attempts to access such a memory-segment (read, write, or fetch), the CPU responds by generating exception-11: “Segment-Not-Present” An ‘error-code’ is pushed • Besides pushing the memory-address of the faulting instruction onto the exceptionhandler’s stack, the CPU also pushes an ‘error-code’ to indicate which descriptor was not yet marked as being ‘Present’ • The handler can then ‘load’ that segment with the proper information and adjust its descriptor’s P-bit, then retry the instruction Error-Code Format 31 15 reserved 3 table-index 2 1 0 T I I E D X T T Legend: EXT = An external event caused the exception (1=yes, 0=no) IDT = table-index refers to Interrupt Descriptor Table (1=yes, 0=no) TI = The Table Indicator flag, used when IDT=0 (1=GDT, 0=LDT) This same error-code format is used with exceptions 0x0B, 0x0C, and 0x0D Our ‘simulation’ demo • We can illustrate the ‘just-in-time’ idea by writing a program that performs a ‘far’ call to an ‘uninitialized’ region of memory: lcall $sel_CS, $draw_message • The code-segment descriptor (referenced here by the selector-value ‘sel_CS’) will be initially marked ‘Not-Present’ (so this ‘lcall’ instruction will trigger an exception-11) Our ‘fault-handler’ • Our Interrupt-Service-Routine for fault-11 will do two things: • Initialize the memory-region with code and data • Mark the code-segment’s descriptor as ‘Present’ • It will carefully preserve the CPU registers, so that it can ‘retry’ the faulting instruction Where is the ‘error-code’? 16-bits SS:SP FLAGS +6 CS +4 IP +2 error-code +0 Layout of our fault-handler’s stack (because we used a 286 interrupt-gate) The Pentium provides a special pair of instructions that procedures can use to address any parameter-values that reside on its stack: ‘enter’ and ‘leave’ Code using ‘enter’ and ‘leave’ isrNPF: # Our fault-handler for exception-0x0B enter $0, $0 call call initialize_the_high_arena mark_segment_as_ready leave add iret $2, %sp # setup stackframe access # discard the frame access # discard the error-code # ‘retry’ the faulting instruction What does ‘enter’ do? • The effect of the single instruction enter $0, $0 is equivalent to this instruction-sequence: push mov %bp %sp, %bp How the stack is changed 16-bits SS:SP 16-bits FLAGS +6 FLAGS +8 CS +4 CS +6 IP +2 IP +4 error-code error-code +0 Layout of our fault-handler’s stack BEFORE executing ‘enter’ SS:SP old-BP +2 SS:BP Layout of our fault-handler’s stack AFTER executing ‘enter’ NOTE: Any memory-references that use indirect addressing via register BP will use the SS segment-register by default (not the DS segment-register) for example: testw $0x0007, 2(%bp) What does ‘leave’ do? • The effect of the single instruction leave is equivalent to this instruction-sequence: mov pop %bp, %sp %bp How the stack is changed 16-bits 16-bits FLAGS +8 CS +6 IP error-code old-BP … +4 +2 SS:BP SS:SP FLAGS +6 CS +4 IP +2 error-code +0 Layout of our fault-handler’s stack AFTER executing ‘leave’ other pushed words SS:SP Layout of our fault-handler’s stack BEFORE executing ‘leave’ So the effect of ‘leave’ is to undo the effect of ‘enter’ Our demo’s memory-layout ARENA #3 (not used by this demo) 0x00030000 Copy contents of ARENA #1 to ARENA #2 ARENA #2 (where our demo expects drawing code will reside) 0x00020000 ARENA #1 (where the loader puts our program code and data) 0x00010000 BOOT_LOCN 0x00007C00 0x00000000 Efficient copying • We use the Pentium’s ‘rep movsw’ instruction to perform memory-to-memory copying operations • The segment-selector for the segment we copy from (it must be ‘readable’) goes into registers DS, and the segment-selector for the segment we copy to (it must be ‘writable’) goes into ES • The number of words we will copy should match the size of our code-segment (which is 64KB) • The Direction-Flag should be cleared (DF=0) Example assembly code ; use ‘forward’ string-copying cld mov mov xor $sel_ds, %si %si, %ds %si, %si ; selector for arena at 0x10000 ; goes in segment-register DS ; start copying from offset zero mov mov xor $sel_DS, %di %di, %es %di, %di ; selector for arena at 0x20000 ; goes in segment-register DS ; start copying to offset zero mov rep $0x8000, %cx movsw ; number of words to be copied ; perform the arena-copying Segment-Descriptor Format 47 63 Base[31..24] 32 RA D CR Limit GDSV P P SX / / A [19..16] VL L DW Base[15..0] Base[23..16] Limit[15..0] 0 31 The segment-descriptor’s ‘Present’ bit is bit-number 47 In-class exercise • To get some practical ‘hands on’ experience with implementing the demand-loading concept we suggest the following exercise: Modify our ‘notready.s’ demo so that it uses a 32-bit Interrupt-Gate for its Segment-Not-Present entry in the Interrupt Descriptor Table (this will affect the layout of the fault-handler’s stack) • You may need to abandon use of the ‘enter’ and ‘leave’ instructions unless you also use a 32-bit data-segment descriptor for your stack-segment