Why can’t we do ‘raw’ I/O? How the x86 stops user-programs from directly controlling devices, and how we devise a ‘workaround’ x86 Privilege Levels • For multiple users doing multiple tasks in a manner that affords each some ‘protection’ against inteference by others, any modern CPU will implement two or more separate levels of ‘privilege’ for its operations -- an ‘unrestricted privileges’ arena for the code in its Master Control Program (its ‘kernel’), and a ‘restricted privileges’ realm for code in users’ application programs Four Privilege Rings Ring 3 Least-trusted level Ring 2 Ring 1 Ring 0 Most-trusted level Suggested purposes Ring0: operating system kernel Ring1: operating system services Ring2: custom extensions Ring3: ordinary user applications Unix/Linux and Windows Ring0: operating system Ring1: unused Ring2: unused Ring3: application programs IOPL • The Intel x86 processor includes a way to either allow or prohibit accesses to system peripheral devices by code that executes in the various ‘privilege rings’, by utilizing a 2-bit field within the x86 FLAGS register which controls whether or not ‘in’ and ‘out’ are allowed to execute – the field is known as the I/O Privilege Level field, and Linux normally sets its value to be zero The x86 API registers RAX RSP R8 R12 RBX RBP R9 R13 RCX RSI R10 R14 RDX RDI R11 R15 CS DS RIP ES FS GS RFLAGS Intel Core-2 Quad processor SS The FLAGS register Status-flags 13 0 N T 12 IOPL O F D F I F T F S F Z F 0 A F 0 P F 1 C F Control-flags Legend: ZF = Zero Flag SF = Sign Flag IOPL = I/O Privilege Level CF = Carry Flag NT = Nested Task PF = Parity Flag TF = Trap Flag OF = Overflow Flag IF = Interrupt Flag AF = Auxiliary FlagDF = Direction Flag ‘seeflags.cpp’ • This demo-program allows us to view the settings of bits in the RFLAGS register – and the IOPL-field in particular (bits 13,12) • When IOPL == 0, only ring0 code will be able to execute ‘in’ and ‘out’ instructions • When IOPL == 3, then code executing in any of the rings will be able to execute I/O • So – let’s change IOPL to 3 – but how? ‘pushfq’/’popfq’ • An idea suggested by the ‘inline’ assembly language in our ‘seeflags.cpp’ demo would be to just ‘pop’ a suitably designed value from the stack into the RFLAGS register • But the CPU is not about to allow that if it’s currently executing ring3 code while IOPL is set to 0 – that would compromise the system’s intended ‘protection’ Must do it from ring0! • Our classroom’s Linux systems will allow us to install our own code-module, as an ‘add-on’ to the running kernel, and such code could therefore be executed without any restrictions (i.e., at ring0) • This idea motivates us to explore briefly the programming ideas needed for writing our own LKM (Linux Kernel Module) A module’s organization The module’s ‘payload’ function my_info module_init The module’s two required administrative functions module_exit Our ‘newproc.cpp’ utility • The type of LKM that creates a pseudo-file in the ‘/proc’ directory, there is a ‘skeleton’ of C-language code we can start from, and then add our own specific functionality to that skeleton-code • You can quickly create this ‘skeleton’ file by using our ‘newproc.cpp’ utility-program Software interrupts • One way a user-program, which normally executes in ring3, to switch to ring0 (if it’s allowed) is by using a ‘software interrupt’ • This is how the 32-bit version of Linux did its various system-calls, with ‘int $0x80’ • We can craft an LKM whose ‘payload’ is an interrupt service routine that would be able to change the IOPL from 0 to 3 Systems programming • To accomplish this design-idea, we’ll need an understanding of our CPU’s interrupt mechanism, including some special datastructures located in kernel memory and some special CPU registers which allow the CPU to locate those data-structures Descriptor Tables Special processor registers used by CPU for locating its Descriptor Tables within the system’s memory Interrupt Descriptor Table (256 Gate Descriptors) Global Descriptor Table (Segment Descriptors) GDT GDTR IDTR IDT IDT Descriptor-format 32-bits reserved (=0) 3 offset 63..32 2 offset 31..16 segment selector P D P L 0 gate type 00000 offset 15..0 LEGEND: segment-selector (for the handler’s code-segment) offset within code-segment to handler’s entry-point gate-type (0xE = Interrupt Gate, 0xF = Trap Gate) IST = Interrupt Stack Table (0..7) P = Present (1 = yes, 0 = no) I S T 1 0 IDTR register-format 80-bits IDTR: Base-Address of the IDT segment (64-bits) segment limit Special processor instructions are used to ‘load’ this 10-byte register from a memory-image (‘LIDT’), or to ‘store’ this register’s value (‘SIDT’) The ‘LIDT’ instruction can only be executed by code running in Ring0, but the ‘SIDT’ can be executed by code running at any privilege level. Stack layout after an interrupt 64-bits SS RSP 32(%rsp) 24(%rsp) RFLAGS 16(%rsp) CS 8(%rsp) RIP RSP0 0(%rsp) Ring0 stack Our interrupt-9 handler Our ‘iokludge.c’ kernel module uses this ‘inline’ assembly language to generate the machine-code for handling an interrupt-9, which merely sets the IOPL-field (in the saved image of the RFLAGS register) to 3, and then resumes execution of the interrupted application program. //-------------------- INTERRUPT SERVICE ROUTINE ----------------void isr_entry( void ); asm(“ .text “); asm(“ .type isr_entry, @function “); asm(“isr_entry: “); asm( orq $0x3000, 16(%rsp) “); asm( iretq “); //-------------------------------------------------------------------------------------- Core-2 Quad system Intel Core-2 Quad processor CPU 0 CPU 1 CPU 2 system memory CPU 3 system bus I/O I/O I/O I/O I/O ‘smp_call_function()’ • This Linux kernel ‘helper’ routine allows a CPU to request all other CPUs to execute a specified subroutine of type: void function( void *info ); • In our current Linux kernel (vers. 2.6.26.6) this helper-routine takes four arguments: – – – – The address of the subroutine’s entry-point The address of data the subroutine needs A flag that indicates whether or not to ‘retry’ A flag that indicates whether or not to ‘wait’ • (Note: Newer kernels omit the ‘retry’ argument) Working with LKM’s • Create an LKM skeleton using ‘newproc’ • Compile an new LKM using ‘mmake’ • Install an LKM’s compiled ‘kernel object’ using the Linux ‘/sbin/insmod’ command • Remove an LKM from the running kernel using the Linux ‘/sbin/rmmod’ command ‘iokludge.c’ module_init: 1) Allocate a kernel memory page, to be used as a new Interrupt Descriptor Table 2) Save original contents of system register IDTR, so it can be restored later 3) Prepare a memory-image for the new value of register IDTR, referring to kpage 4) Setup pointers ‘oldidt’ and ‘newidt’ and copy the original IDT to our new page 5) Setup a Gate-Descriptor, to be installed as Gate 9 in our new IDT array 6) Activate the new Interrupt Descriptor Table on all the processors in our system 7) Return 0, to indicate a successful module-installation module_exit: 1) Restore the original value to register IDTR in each of our system’s processors 2) Free the page of kernel memory that was previously allocated for use as an IDT ‘tryiopl3.cpp’ • This demo-program is a modification of our earlier ‘seeflags.cpp’ example – but here we included the software interrupt instruction ‘int $9’ which, if ‘iokludge.ko’ has been installed, will allow us to check that indeed the RFLAGS register’s IOPL has been changed from 0 to 3 – thereby permitting ‘in’ and ‘out’ to be executed! Homework exercise • Modify the ‘82573pci.cpp’ program that we weren’t able to execute, even with ‘sudo’, at our previous class meeting, replacing its call to Linux’s ‘iopl()’ library-function by the ‘inline’ assembly language statement for software interrupt 9, i.e. asm(“ int $9 “); • Then try again to compile and execute our ‘82573.cpp’ demo-program, only this time with our ‘iokludge.ko’ LKM installed