Chapter 8 Interfacing Processors and Peripherals Performance Analysis of Synchronous vs. Asynchronous • Compare the maximum bandwidth for a synchronous and an asynchronous bus: – synchronous bus: clock cycle=50ns, each bus transmission takes 1 clock cycle. – asynchronous bus: 40ns per handshake – bus width = 32 bits • Find the bandwidth for each bus when performing one-word reads from a 200ns memory. Synchronous Bus • • • • • Send the address to memory: 50ns Read the memory: 200ns Send the data to the device: 50ns Total time= 300ns Bandwidth = 4bytes/300ns = 13.3 MB/sec Asynchronous Bus • Step 1: 40ns • Step 2,3,4: max(3x40ns, 200ns) = 200ns (steps 2,3,4 can be overlapped with memory access) • Step 5,6,7 3x40ns =120ns • Total=360ns • Bandwidth = 4bytes/360ns = 11.1 MB/sec Performance Analysis of Two Bus Schemes • Given a system with – a memory and bus system supporting block access of 4 to 16 words – a 64-bit synchronous bus clocked at 200MHz, with each 64-bit transfer taking 1 clock cycle, and 1 clock cycle to send an address to memory – two clock cycles needed between each bus operation – memory access for first 4 words takes 200ns, each additional set of 4 words requires 20ns Question • Find the sustained bandwidth and latency for a read of 256 words for transfers using 4-word blocks and 16-word blocks. • Find the effective number of bus transactions for each case. 4-Word Block Transfer • 1 clock cycle to send address to memory • 200ns/(5ns/cycle) = 40 cycles to read memory • 2 cycles to send data from memory • 2 idle cycles • Total = 45 cycles • 256 words requires 45x64= 2880 cycles 4-Word Block Transfer • Latency = 2880 cycles x 5ns/cycle = 14400 ns • Number of bus transactions = 64 x 1s/14400ns = 4.44M transactions/s • Bandwidth = (256x4 bytes)x 1/14400ns = 71.11 MB/s 16-Word Block Transfer • 1 clock cycle to send address to memory • 40 cycles to read first 4 words from memory • 2 cycles to send data, during which the read of the next 4 words is started. • 2 idle cycles between transfers, during which the read of the next block is completed. • Need to repeat the last two steps 3 times to read a total of 16 words. 16-Word Block Transfer • Total cycles required = 1 + 40 + 4x(2+2) =57 cycles • 256/16=16 transactions are required • Total number of cycles required for 256 word = 16x57 = 912 cycles, latency = 4560 ns • Number of bus transactions = 16 x 1s/4560ns = 3.51M transactions/s • Bandwidth = (256x4 bytes)x 1/4560ns = 224.56 MB/ Bus Standards • • • • PCI ( a general purpose backplane bus) SCSI (Small Computer System Interface) IEEE 1394 (Firewire) USB 2.0 Interfacing I/O Devices • How is a user I/O request transformed into a device command and communicated to the device? • How is data actually transferred to or from a memory location? • What is the role of the operating system? Role of the OS • The OS plays a major role in handling I/O, in that: – I/O system is shared by multiple programs using the processor – I/O system often use interrupts (cause transfer to supervisor mode) – low-level control of I/O is complex Communications between OS and I/O Devices • The OS must be able to give commands to I/O. • The I/O must be able to notify the OS when operation is completed or error has occurred. • Data must be transferred between memory and an I/O device. Giving Commands to I/O • To give a command, the processor must be able to address the device and to supply command words: – memory-mapped I/O: portions of the address space is assigned to I/O devices – special I/O: dedicated I/O instructions in the processor. Communicating with the Processor • Polling: processor periodically checks the status of I/O. • Overhead of polling in an I/O system – Example 1: mouse – Example 2: floppy disk – Example 3: hard disk Mouse • Assume the number of clock cycles for a polling operation, including transferring to the polling routine, accessing the device, and restarting the user program, is 400, with a 500 MHz clock. • The mouse must be polled 30 times a second to ensure that no user movement is missed. • Fraction of CPU time = 30x400/(500x10^6) = 0.002% Floppy Disk • The floppy disk transfers data to the processor in 16-bit units and has a data rate of 50KB/s. • Polling rate = (50KB/s)/(2 Bytes/polling) = 25K polling/sec • Fraction of CPU time = 25Kx400/(500x10^6) = 2% Hard Disk • Transfer in 4-word blocks • transfer rate: 4MB/s • Polling rate = (4MB/s)/(4x4 Bytes/polling) = 250K polling/sec • Fraction of CPU time = 250Kx400/(500x10^6) = 20% Overhead of Polling • Can do the polling only when the device is active, thus reducing the overhead. • However, the overhead is still significant, resulting in another design called interruptdriven I/O. Overhead of Interrupt-Driven I/O • Assume the overhead for each transfer, including the interrupt, is 500 cycles. • Cycles per second for disk = 250Kx500 = 125x10^6 cycles • Fraction of processor consumed = 125x10^6/(500x10^6) = 25% • Assuming disk is transferring data 5% of the time, fraction of CPU on average = 25%x5%=1.25% Direct Memory Access(DMA) • If disk is transferring data most of the time, the overhead for interrupt-driven I/O is still high. • For high-bandwidth device, let the device controller transfer data directly to or from the memory without involving the processor, known as direct memory access. • Interrupt is used to signal the completion of I/O transfer or error. Overhead of I/O Using DMA • Assume initial setup of DMA transfer takes 1000 cycles, handling of interrupt at DMA completion takes 500 cycles, average transfer from disk is 8KB • Each DMA transfer takes 8KB/(4MB/s) = 2x10^-3s • If the disk is constantly transferring data, it requires: (1000+500)/(2x10^-3) = 750x10^3 cycles • Fraction of CPU time= 750x10^3/(500x10^6) = 0.15% I/O System Design • Latency constraints: ensuring the latency to complete and I/O operation is bounded. • Bandwidth constraints • Performance Analysis techniques: — queuing theory — simulation — analysis I/O System Design- Example • CPU: 300 MIPS, average 5000 instructions in the OS per I/O operation • backplane bus transfer rate: 100 MB/s • SCSI-2 controller with transfer rate = 20 MB/s, accommodating up to 7 disks • Disk bandwidth = 5MB/s, seek+rotational latency=10ms • Workload: 64-KB reads, user program need 100000 instructions per I/O Example • Find – the maximum sustainable I/O rate – the number of disks and SCSI controller required.