CV - queue of waiters; ALWAYS hold lock & while loop Cache coherence - Modified-Shared-Invalid (MSI) protocol - M = write to cache -> must invalidate peer caches - overhead! test & set is also a write to shared state Logging/Reliability Hardware guarantees: 1. Reorder pending writes, 2. Atomic sector writes Crash Fault - consistency: want either complete or not MCS Lock (solution to test&set MSI in multiprocessor) - Thread-specific struct: Next + Wait fields - eliminates contention of shared lock (thread-local wait) RW Lock: Wait for all readers before writes - sleeplock + condition variable; one writer, multiple readers RCU (Read-Copy-Update) - eventual consistency (after grace period ends) - concurrent readers during update; single write before end of grace period - Alternative: Ordering + FSCK - check every file on system 1.alloc blk 2.write 3.alloc inode 4.write 5.upd FI/FD 6.upd dir Scheduling: - response (time to schedule) - turnaround (time to complete) A1: same runtime, A2: arrive A3: task completion, A4: No I/O A5: known runtime Policies (opt) FIFO - A1, SJF - A2+A3, STCF - A4 (turnaround), RR - A4 (fair+response) MLFQ - increasing quantums (interval) + priority RR queues - if exceeds quantum, prio ↓ ; yield before quantum, prio = - prio inversion: high prio depend on low prio -> blocked -> solve by bumping low prio to same prio as blocked proc) Lottery: assign # tickets based on prio (chance ↑) -> fair MLFQ multiprocessor - lock+write cache contention, affinity - per-core MLFQ, avoid contention + idle cores steal work Device I/O MMIO: R+W to device address regions (start+len) within OS Fail-Stop Fault (disk completely fails) - RAID 0: data on different disks RR style -> no redundancy - RAID 1: complete duplicate data -> wastes memory - RAID 10: groups of redundant disks -> hybrid 0+1 - RAID 5: RR parity (0 XOR 1 = P1, P1 XOR 1 = 0) Inode - metadata (permissions, data locations); file=(inode #, seq<bytes>/data block) Directory - (Inode #, Map<file name, Inode #>) Pathname/Link - Hard: point to inode; Soft: point to path Mount - named mapping to another file system Mount point - target DIR to hold a file system File Descriptor Unix API - read/write (from fd -> buf), open -> returns fd lseek - change current position in file to read/write fsync - flush modifications (in block cache) to disk File Systems: VSFS: superblock(type=VSFS), data block, inode, free bitmap (FI, FD) (pointer to free space); +small files -large FAT: linked list of entry + data block (any size) + large files, - random access (O(N)) FFS: inode array w direct + indirect pointers Bus Controller: manage peripheral bus, translate higher lvl ops (MMIO) for specific devices (wire operations, I2C) Polling: busy waiting for I/O; Interrupt: RAM L/S->CPU->device, concurrent execution DMA: CPU offloads L/S; from RAM to device memory File I/O: ordered hardware -> app (low to high) HDD (Disk) - avg random read=seek+0.5*rotation+transfer - seek: time to move arm to track; rotation: time to rotate disk once; transfer - time to transfer data to/from disk Disk Driver - create block interface for HDD + descriptors Block/Buf Cache - maintain block data in memory + large files, + random seek, - small files (need inode + data block, internal fragmentation) and contiguous data (still need pointer for all data blocks) Solutions: NTFS/EXT4 - contiguous -> extent (ptr, length) = only need 1 block - small file -> resident data (store file in Inode itself) Cylinder group (mini FS): same DIR files (ideally), bitmap, inode -> same track Log-Structured File System (LFS): store files in log - FAT linked list with dictionary to support random access - solve gap in random/seq I/O + full bandwidth of disk (don’t need to keep log & data) Power Management DVFS controls P-States (efficiency when core is utilized): - Performance (MAX freq), Powersave (MIN freq), Schedutil (specific util level), Ondemand (dynamic ↑ or ↓) System PM States (power consumption, wakeup latency): - Sleep (Keep RAM on); Hibernate (off + write to disk) Drowsy Power Management: wake up only what’s needed (min wake set) Virtual machines: guest OS sees VM as physical machine Containers: Isolation between container + other software (more lightweight -> only apps + dependencies/libraries) Security Side-Channel Attacks: issues w/ system implementation - examples: cache, timing, speculative execution Trusted Computing Base (TCB): min set to trust - hardware, OS, assembler, compiler, etc. Authentication: One-way encryption functions (SHA-256) - only store hashed + salted passwords Authorization: - Reference Monitor: determines if action is allowed - Access Control List: table of files + permissions for user - Capabilities: specific tokens give diff access to files +easier to count tokens, derive for children -hard to delete Kernel Page Table Isolation: User only contains entrypoints to kernel (trampoline) -> prevents Meltdown cache attack WARD paper: - fine grain separation of public vs private (user vs kernel) - clear software/hardware contract Architectures Monolithic Kernel: entire OS in kernel, big syscall interface Microkernel: min set of reqs to support OS, <20ish syscalls - add “plug-ins” to supplement every other feature - IPC: send/receive/call/reply -> important syscalls to facilitate communication between external plugins and OS + Security (more isolation) - performance + usability Labs Trap&Emulate: Traps jump to hypervisor - App -> hypervisor (ecall) -> OS -> hypervisor (eret) -> App Hypervisor Extension: split control of privileges in guest OS - regular syscall -> no need for hypervisor (virtualized) - others (translate/map VA): hyp-ext supervisor handles 2-level Address Translation: (create illusion of physical memory for guest OS) Guest virtual->Guest PT->guest physical->Host PT->Host physical