Multiprocessor Initialization An introduction to the use of Interprocessor Interrupts Multiprocessor topology Back Side Bus Local APIC CPU #0 Local APIC CPU #1 IO APIC Front Side Bus peripheral devices system memory bridge The Local-APIC ID register 31 24 APIC ID 0 reserved This register is initially zero, but its APIC ID Field (8-bits) is programmed by the BIOS during system startup with a unique processor identificationnumber which subsequently is used when specifying the processor as a recipient of inter-processor interrupts. Memory-Mapped Register-Address: 0xFEE00020 The Local-APIC EOI register 31 0 write-only register This write-only register is used by Interrupt Service Routines to issue an ‘End-Of-Interrupt’ command to the Local-APIC. Any value written to this register will be interpreted by the Local-APIC as an EOI command. The value stored in this register is initially zero (and it will remain unchanged). Memory-Mapped Register-Address: 0xFEE000B0 The Spurious Interrupt register 31 8 7 reserved E N 0 spurious vector Local-APIC is Enabled (1=yes, 0=no) This register is used to Enable/Disable the functioning of the Local-APIC, and when enabled, to specify the interrupt-vector number to be delivered to the processor in case the Local-APIC generates a ‘spurious’ interrupt. (In some processor-models, the vector’s lowest 4-bits are hardwired 1s.) Memory-Mapped Register-Address: 0xFEE000F0 Interrupt Command Register • Each Pentium’s Local-APIC has a 64-bit Interrupt Command Register • It can be programmed by system software to transmit messages (via the Back Side Bus) to one or several other processors • Each processor has a unique identification number in its APIC Local-ID Register that can be used for directing messages to it ICR (upper 32-bits) 31 24 Destination field 0 reserved The Destination Field (8-bits) can be used to specify which processor (or group of processors) will receive the message Memory-Mapped Register-Address: 0xFEE00310 ICR (lower 32-bits) 31 19 18 15 12 10 R / O Destination Shorthand 00 = no shorthand 01 = only to self 10 = all including self 11 = all excluding self Trigger Mode 0 = Edge Level 1 = Level 0 = De-assert 1 = Assert Delivery Status 0 = Idle 1 = Pending 8 7 0 Vector field Delivery Mode 000 = Fixed 001 = Lowest Priority 010 = SMI 011 = (reserved) 100 = NMI 101 = INIT 110 = Start Up 111 = (reserved) Destination Mode 0 = Physical 1 = Logical Memory-Mapped Register-Address: 0xFEE00300 MP initialization protocol • • • • • • • • Set shared processor-counter equal to 1 Step 1: issue an ‘INIT’ IPI to all-except-self Delay for 10 milliseconds Step 2: issue ‘Startup’ IPI to all-except-self Delay for 200 microseconds Step 3: issue ‘Startup’ IPI to all-except-self Delay for 200 microseconds Check the value of the processor-counter Issue ‘INIT’ IPI # address Local-APIC via register FS mov $sel_fs, %ax mov %ax, %fs # broadcast ‘INIT’ IPI to ‘all-except-self’ mov $0x000C4500, %eax mov %eax, %fs:0xFEE00300) .B0: btl $12, %fs:(0xFEE00300) jc .B0 Issue ‘Startup’ IPI # broadcast ‘Startup’ IPI to all-except-self # using vector 0x11 to specify entry-point # at real memory-address 0x00011000 mov $0x000C4611, %eax mov %eax, %fs:(0xFEE00300) .B1: btl $12, %fs:(0xFEE00300) jc .B1 Timing delays • Intel’s MP Initialization Protocol specifies the use of some timing-delays: – 10 milliseconds ( = 10,000 microseconds) – 200 microseconds • We can use the 8254 Timer’s Channel 2 for implementing these timed delays, by programming it for ‘one-shot’ countdown mode, then polling bit #5 at i/o port 0x61 Mathematical examples EXAMPLE 1 Delaying for 10-milliseconds means delaying for 1/100-th of a second (because 100 times 10 milliseconds = one-thousand milliseconds) EXAMPLE 2 Delaying for 200-microseconds means delaying 1/5000-th of a second (because 5000 times 200 microseconds = one-million microseconds) GENERAL PRINCIPLE Delaying for x–microseconds means delaying for 1000000/x seconds (because 1000000/x times x-microseconds = one-million microseconds) Mathematical theory PROBLEM: Given the desired delay-time in microseconds, express the desired delay-time in clock-frequency pulses and program that number into the PIT’s Latch-Register RECALL: Clock-Frequency-in-Seconds = 1193182 Hertz ALSO: One second equals one-million microseconds APPLYING DIMENSIONAL ANALYSIS Pulses-Per-Microsecond = Pulses-Per-Second / Microseconds-Per-Second Delay-in-Clock-Pulses = Delay-in-Microseconds * Pulses-Per-Microsecond CONCLUSION For a desired time-delay of x microseconds, the number of clock-pulses may be computed as x * (1193182 /1000000) = 1193182 / (1000000 / x ) as dividing by a fraction amounts to multiplying by that fraction’s reciprocal Delaying for EAX microseconds # We use the 8254 Timer/Counter Channel 2 to generate a # timed delay (expressed in microseconds by value in EAX) mov %eax, %ecx # copy delay-time to ECX mov %1000000, %eax # microseconds-per-sec xor %edx, %edx # extended to quadword div %ecx # perform dword division mov %eax, %ecx # copy quotient into ECX mov $1193182, %ecx # input-pulses-per-sec xor %edx, %edx # extended to quadword div %ecx # perform dword division # now transfer the quotient from AX to the Channel 2 Latch Mutual Exclusion • Shared variables must not be modified by more than one processor at a time (‘mutual exclusion’) • The Pentium’s ‘lock’ prefix helps enforce this • Example: every processor adds 1 to count lock incl (count) • Example: all processors needs private stacks mov lock xadd mov 0x1000, %ax [new_SS], %ax %ax, %ss ROM-BIOS isn’t ‘reentrant’ • The video service-functions in ROM-BIOS that we use to display a message-string at the current cursor-location (and afterward advance the cursor) modify global storage locations (as well as i/o ports), and hence must be called by one processor at a time • A shared memory-variable (called ‘mutex’) is used to enforce this mutual exclusion Implementing a ‘spinlock’ mutex: spin: .word 1 btw $0, mutex jnc spin lock btrw $0, mutex jnc spin # <CRITICAL SECTION OF CODE GOES HERE> lock btsw $0, mutex Demo: ‘smphello.s’ • Each CPU needs to access its Local-APIC • The BSP (“Boot-Strap Processor”) wakes up other processors by broadcasting the ‘INIT-SIPI-SIPI’ message-sequence • Each AP (“Application Processor”) starts executing at a 4K page-boundary, and needs its own private stack-area • Shared variables need ‘exclusive’ access In-class exercise • Include this procedure that multiple CPUs will execute simultaneously (without ‘lock) total: .word 0 # the shared variable add_one_thousand: mov $1000, %cx nxinc: addw $1, (total) loop nxinc ret We may need a ‘barrier’ • We can use a software construct (known as a ‘barrier’) to stop CPUs from entering a block of code until a prescribed number of them are all ready to enter it together arrived: .word 0 # shared variable barrier: lock incw (arrived) await: cmpw $2, (arrived) jb await call add_one_thouand