CS 1104 Help Session III I/O and Buses Colin Tan, ctank@comp.nus.edu.sg S15-04-15 Why do we need I/O Devices? • No computer is an island – Need to interact with people and other computers and devices. – People don’t speak binary, so need keyboards and display units that “speak” English (or Chinese etc.) to interface with people. – Communicating with other computers presents its own problems. • Synchronization • Erroneous transmission/reception • Error and failure recovery (e.g. Other computer died while talking) The Need for I/O • Also, there is a need for permanent storage: – Main memory, cache etc are volatile devices. Contents lost when power is lost. – So we have hard-disks to store files, data, etc. • Impractical to build all of these into the CPU – Too complex, too big to put on CPU – The varying needs of users mean that the types of I/O devices will vary. • Office user will need laser printer, but ticketing officer needs dot-matrix printer. Devices We Will Be Looking At • Disk Drives – Data is organized into fixed-size blocks. – Blocks are organized in concentric circles called tracks. The tracks are created on both sides of metallic disks called “platters”. – The corresponding tracks on each side of each platter form a cylinder (e.g. Track 0 on side 0 of platter 0, track 0 on side 1 of platter 0, track 0 on side 0 of platter 1 etc.) • Latencies are involved in finding the data and reading or writing it. Devices We Will Be Looking At • Network Devices – Transmits data between computers. – Data often organized in blocks called “packets”, or sometimes into frames. – Latencies are involved in sending/receiving data over a network. Devices We Will Be Looking At • Buses – Buses carry data and control signals from CPU to/from memory and I/O devices. • DMA Controller (DMAC) – The DMAC performs transfers between I/O devices and memory without CPU intervention. Disk Drives • Latencies involved in accessing drives: – Controller overheads: To read or write a drive, a request must be made to the drive controller. The drive controller may take some time to respond. This delay is called the “controller overhead”, and is usually ignored. • Controller overhead time also consists of delays introduced by controller circuitry in transferring data. – Head-Selection Time:Each side of each platter has a head. To read the disk, first select which side of which platter to read/write by activating its head and deactivating the other heads. Normally this time is ignored. Disk Drives • Latencies involved in accessing drives: – Seek Time:Once the correct head has been selected, it must be moved over the correct track. The time taken to do this is called the “seek time”, and is usually between 8 to 20 ms (NOT NEGLIGIBLE!) – Rotational Latency: Even when the head is over the correct track, it must wait for the block it wants to read to rotate by. • The average rotational latency is T/2, where T is the period (in seconds) of the rotation speed R (60/R if R is specified in RPM, or 1/R if R is specified in RPS) Disk Drives • Latencies involved in accessing drives: – Transfer Time: This is the time taken to actually read the data. If the throughput of the drive is given as X MB/s and we want to read Y bytes of data, then the transfer time is given by: Y/(X * 10^6) Example • A program is written to access 3 blocks of data (the blocks are not contiguous and may exist anywhere on the disk) from a disk with rotation speed of 7200 rpm, 12ms seek time, throughput of 10 MB/S and a block size of 16 KB. Compute the worst case timing for accessing the 3 blocks. Example • Analysis: – Each block can be anywhere on the disk • In the worst case, we must incur seek, rotational and transfer delays for every block. – What is the timing for each delay? • • • • • Controller Overhead - Negligible (since not given) Head-switching time - Negligible (since not given) Seek time Rotational Latency Transfer time. – How many times are each of these delays incurred? Example • A disk drive has a rotational speed of 7200 rpm. Each block is 16KB, and there are 16 blocks per track. There are 22 platters with 25 tracks each. The average seek time is 12ms. – What is the capacity of this disk? – How long does it take to read 1 block of data? Example • Analysis – Size: • How many sides are there? How many tracks per side? How many blocks per track? How big is each block? – Time to read 1 block • Throughput is not given. How to work it out? Network Devices • Some major network types in use today: – Ethernet • The most common networking technology • Poor performance under high traffic. – FDDI - uses laser and fibre optic technology to transmit data • Fast, expensive. • Slowly being replaced by gigabit ethernets. – Asynchronous Transfer Mode (ATM) • Fast throughput by using simple and fast components • Very expensive. • Example of daily ATM use: Singtel Magix (ADSL) Ethernet Packet Format Data Pad Check Preamble Dest Addr Src Addr 8 Bytes 6 Bytes 6 Bytes 0-1500B 0-46B 4B Length of Data 2 Bytes • Preamble to recognize beginning of packet • Unique Address per Ethernet Network Interface Card so can just plug in & use • Pad ensures minimum packet is 64 bytes – Easier to find packet on the wire • Header+ Trailer: 24B + Pad Software Protocol to Send and Receive • SW Send steps 1: Application copies data to OS buffer 2: OS calculates checksum, starts timer 3: OS sends data to network interface HW and says start • SW Receive steps 3: OS copies data from network interface HW to OS buffer 2: OS calculates checksum, if OK, send ACK; if not, delete message (sender resends when timer expires) 1: If OK, OS copies data to user address space, & signals application to continue Network Devices • Latencies Involved: – Interconnect Time: This is the time taken for 2 stations to “hand-shake” and establish a communications session – Hardware Latencies: There is some latency in gaining access to a medium (e.g. In Ethernet the Network Interface Card (NIC) must wait for the Ethernet cable to be free of other activity) and in reading/writing to the medium. – Software Latencies: Network access often requires multiple buffer copying operations, leading to delays. Network Devices • Latencies Involved: – Propagation Delays: For very large networks stretching thousands of miles, signals do not reach their destination immediately, and take some time to travel in the wire. More details in CS2105. – Switching Delays: Large networks often have intermediate switches to receive and re-transmit data (to restore signal integrity, for routing etc.). These switches introduce delays too. More details in CS2105. Network Devices • Latencies Involved: – Data Transfer Time: Time taken to actually transfer the data. If we wish to transfer Y bytes of data over a network link with a throughput of X MBPS, the data transfer time is given by: (Y bytes)/(X * 10^6) – Aside from the Data Transfer Time (where real useful work is actually being done), all of the other latencies do not accomplish anything useful (but are still necessary), and these are termed “overheads”. Network Devices • Note that if the overheads are much larger than the data transfer time, it is possible for a slow network with low overheads to perform better than a fast network with high overheads. – E.g. Page 654 of Patterson & Hennessy. Example • A communications program was written and profiled, and it was found that it takes 40ns to copy data to and from the network. It was also found that it takes 100ns to establish a connection, and that effective throughput was 5 MBPS. Compute how long it takes to send a 32KB block of data over the network. Example • Analysis: – What are the overheads? What is the data transfer time? Buses • Buses are extremely important devices (essentially they’re groups of wires) that bring data and control signals from one part of a system to another. • Categories of bus lines: – Control Lines: These carry control signals like READ/WRITE signals and the CLK signal. – Address Lines: These contain identifiers of devices to read/write from, or addresses of memory locations to access. Buses – Data Lines: These actually carry the data we want to transfer • Sometimes the data/address lines are multiplexed onto the same set of lines. This allows us to build cheaper but slower buses – Must alternate between sending addresses and sending data, instead of spending all the time sending data. Types of Buses • 3 Broad Category of Buses – CPU/Memory Bus: These are very fast (100 MHz or more), very short buses that connect the CPU to the cache system and the cache system to the main memory. If the cache is on-chip, then it connects the CPU to the main memory. – I/O Bus: The I/O bus connects I/O devices to the CPU/Memory Bus, and is often very slow (12 MHz to 66 MHz). – Backplane Bus: The backplane bus is a mix of the 2, and often CPU, memory and I/O devices all connect to the same backplane bus. Combining Bus Types • Can have several schemes: – 1 bus system: CPU, memory, I/O devices all connected to the memory bus. – 2 bus system: CPU, memory connected via memory bus, and I/O connected via I/O bus. – 3 bus system: CPU and memory connected via memory bus, I/O connected via small set of backplane buses. 1-Bus System Memory Bus CPU Tape Console Memory Disk • 1-bus system: CPU, memory and I/O share single bus. • Bad bad bad - I/O very slow, slows down the memory bus. • Affects performance of memory accesses and hence overall CPU performance. 2-Bus System CPU Memory Bus Memory Bus Adapter Tape I/O Bus Console Disk • 2-bus system: CPU and memory communicate via memory bus. • I/O devices send data via I/O bus. 2-Bus System • I/O Bus is de-coupled from memory bus by I/O controller. • I/O controller will coordinate transfers between the fast memory bus and the slow I/O bus. – Buffers data between buses so no data is lost. – Arbitrates for memory bus if necessary. • In the notes, the I/O controller is called a “Bus Adaptor”. Both words mean the same thing. 3-Bus System CPU Memory Bus Memory Bus Adapter Console Disk Bus Adapter I/O Bus Bus Adapter I/O Bus Backplane Bus Tape • Memory and CPU still connected directly – This is important because it allows fast CPU/memory interaction. 3-Bus System • A backplane bus interfaces with the memory bus via a Bus Adapter. – Backplane buses typically have very high bandwidth, • Not quite as high as memory bus though. • Multiple I/O buses interface with the backplane bus. – Possible to have devices on different I/O buses communicating with each other, with the CPU completely uninvolved! • Very efficient I/O transfers possible. Synchronous vs Asynchronous • Synchronous buses: Operations are coordinated based on a common clock. • Asynchronous buses: Operations are coordinated based on control signals. Synchronous Example (Optional) • A typical memory system works in the following way: – Addresses are first placed on the address bus. – After a delay of 1 cycle (the hold time), the READ signal is asserted. – After 4 cycles, the data will become available on the data lines. – The data remains available for 2 cycles after the READ signal is de-asserted, during which time no new read operations may be performed. Synchronous Example (Optional) CLK ADDR READ DATA Synchronous Example (Optional) • Given that the synchronous bus in the previous example is operating at 200MHz, and that the time taken to read 1 word (4 bytes) of data from the DATA bus is 40ns, compute the maximum memory read bandwidth for this bus (assume that the READ line is dropped only after reading the data). Assume also that the time taken to place the address on the address bus is negligible. Synchronous Example (Optional) • Analysis: – How long is each clock cycle in seconds or ns? – How long does it take to set up the read? (put address on address bus, assert the READ signal, wait for the data to appear, read the data, de-assert the READ signal) – How long does it take before you can repeat the READ operation? – Therefore, in 1 second, how many bytes of data can you read? Asynchronous Bus Example (Optional) • Asynchronous buses use a set of request/grant lines to perform data transfers instead of a central clock. • E.g. Suppose CPU wants to write to memory – 1. CPU will make a request by asserting the MEMW line. – 2. Memory sees MEMW line asserted, and knows that CPU wants to write to memory. It asserts a WGNT line to indicate the CPU may proceed with the write. – 3. CPU sees the WGNT line asserted, and begins writing. Asynchronous Bus Example (Optional) – 4. When CPU has finished writing, it de-asserts the MEMW line. – 5. Memory sees MEMW line de-asserted, and knows that CPU has completed writing. – 6. In response, memory de-asserts the WGNT line. CPU sees WGNT line de-asserted, and knows that memory understands that writing is complete. Asynchronous vs. Synchronous A Summary • Asynchronous Buses – Coordination is based on the status of control lines (MEMW, WGNT in our example). – Timing is not very critical. Devices can work as fast or as slow as they want without worrying about timing. – More difficult to design and build devices for async buses. • Need good understanding of protocol. • Synchronous Buses – Coordination is based on a central clock. – Timing is CRITICAL. If a device exceeds or goes below the specified number of clock cycles, system will fail (“clock skewing”). – However synchronous buses are fast, and simpler to design devices for it. Bus Arbitration • Often more than one device is trying to gain access to a bus: – A CPU and a DMA controller may both be trying to use the CPU-Memory bus. – Only 1 device can use the bus each time, so need a way to arbitrate who gets to use it. • Buses are common set of wires shared by many devices. • If >1 device tries to access the bus at the same time, there will be collisions and the data sent/received along the bus will be corrupted beyond recovery. – Solve by prioritizing: If n devices need to use the bus, the one with the highest priority will use it. Bus Arbitration • Bus arbitration may be done in a co-operative way (each device knows and co-operates in determining who has higher priority) – No single point of failure – Complicated • May also have a central arbiter to make decisions – Easier to implement – Bottleneck, single point of failure. Central Arbitration Dev1 Dev0 Req0 Dev2 Dev3 Req1 Req2 Bus Controller Req3 GNT0 GNT1 GNT2 GNT3 • Devices wishing to use the bus will send a request to the controller. • The controller will decide which device can use the bus, and assert its grant line (GNTx) to tell it. Distributed Arbitration • Devices can also decide amongst themselves who should use the bus. Dev0 Dev1 Dev2 Dev3 Req0 Req1 Req3 Req2 GNT0 GNT1 GNT2 GNT3 Request Grant • Every device knows which other devices are requesting. • Each device will use an algorithm to collectively agree who will use the bus. • The device that wins will assert its GNTx line to show that it knows that it has won and will proceed to use the bus. Arbitration Schemes • Round Robin (Centralized or Distributed Arbitration) – Arbiter keeps record of which device last had the highest priority to use the bus. – If dev0 had the highest priority, on the next request cycle dev1 will have the highest priority, then dev2 all the way to devn, and it begins again with dev0. Arbitration Schemes • Daisy Chain (Usually centralized arbitration) Req Bus Controller Req GNT Dev 0 GNT Req Dev 1 GNT Req Dev 2 GNT • Only 1 request and 1 grant line. • Request lines are relayed to the bus controller through the intervening devices. • If the bus controller sees a request, it will assert the GNT line Dev 3 Arbitration Schemes – The GNT line is again relayed through intervening devices, until it finally reaches the requesting device, and the device can now use the bus. – If an intervening device also needs the bus, it can hijack the GNT signal and use the bus, instead of relaying it on to the downstream requesting device. • E.g. If both Dev3 and Dev1 request for the bus, the controller will assert GNT. Dev1 will hijack the GNT and use the bus instead of passing the GNT on to Dev3. – Devices closer to the arbiter have higher priority. – Possible to starve lower-priority devices. Arbitration Schemes • Collision Detection – This scheme is used in Ethernet, the main LAN technology that connects computers together. – Properly called “Carrier Sense Multiple Access with Collision Detection”, or CSMA/CD. – In such schemes, all devices (“stations”) have permanent and continuous access to the bus: IBM Compatible Workstation Laptop computer Ethernet Mac SE/Classic IBM Compatible Arbitration Schemes • CSMA/CD Algorithm – Suppose a station A wishes to transmit: • Check bus, and see if any station is transmitting. • If no, transmit. If yes, wait until bus becomes free. • Once free, start transmitting. While transmitting, listen to the bus for collisions. – Collisions can be detected by a sudden increase in the average bus voltage level. – Collisions occur when at least 2 stations A and B see that the bus is free, and begin transmitting together. – In event of a collision: • All stations stop transmitting immediately. • All stations wait a random amount of time, test bus, and restart transmission if free. Arbitration Schemes • Advantages: – Completely distributed arbitration, little coordination between stations needed. – Very good performance under light traffic (few stations transmitting. • Disadvantages – Performs degrades exponentially relative to number of stations transmitting • If many stations wish to transmit together, there will be many collisions and stations will need to resend data repeatedly. • At worst case, effective throughput can fall to 0. Arbitration Schemes • Fixed Priority (Centralized or Distributed Arbitration) – Some devices have higher priority than others. – This priority is fixed. A Bus Analysis Example • Page 665 - The example that no one understands Suppose we have a system with the following characteristics: – A memory and bus system supporting block access of 4 and 16 32 bit words. – A 64-bit synchronous bus clocked at 200MHz, and each 64-bit transfer takes 1 cycle, and 1 cycle required to send an address to memory. – Two clock cycles between each bus operation (assume the bus is idle before an access) – A memory access tme for the first four words of 200 ns; each additional set of four words can be read in 20ns. Assume that a bus transfer of the most recently read data and a read of the next four words can be overlapped. • Find the sustained bandwidth for a read of 256 words for transfers that use 4 word blocks and 16 word blocks. A Bus Analysis Example • Analysis: A 200MHz clock means that each cycle is 5ns. 1. For 4-word block transfers: • Need 1 cycle to send address to memory • Need 200ns/5ns = 40 cycles to read the first 4 words. • The bus is 64-bits wide, which is 2-words. To send 4 words, need to send over 2 cycles (2 words per cycle). So need 2 cycles to send the data. • Need further 2 cycles idle time between transfers • This makes a total of 45 cycles. – To transfer 256 words, need a total of 256/4 = 64 transactions – Total number of cycles = 64 * 45 = 2,880 cycles. Bus Analysis Example – Each cycle is 5ns, so 2,880 cycles is 14,400ns. – Since each 64 transactions takes 14,400ns, total number of transactions = 64 * 1s/(14,400ns) – To find bandwidth: • It takes 14,400ns to transfer 256 words. So in one second, we can transfer (256*4) * 1s/(14,400ns) bytes. Bus Analysis Example • For 16-word blocks: – – – – – It takes 1 cycle to send the address 40 cycles to read the first 4 words 2 cycles to send the 4 words 2 cycles of idle time. Since we can overlap the read of the next 4 words (taking 20ns = 4 cycles) together with this 2-cycle transfer time and 2-cycle idle time, the data for the next 4 words will be ready once these 4 cycles are up. Bus Analysis Example • Hence it is now possible to send the subsequent four words in 2 cycles, and idle for two cycles. • Again, during this send-and-idle time of 4 cycles, the 3rd group of 4 words becomes available, and can be sent in 2 cycles, and idle for 2 cycles. • During this time the 4th group of 4 words becomes available and sent over 2 cycles, followed by an idle of 2 cycles. • Hence the total number of cycles required is: 1 + 40 + (2 + 2) + (2 + 2) + (2 + 2) + (2 + 2) = 57 cycles Bus Analysis Example • It takes 57 cycles per transaction, and total number of transactions = 256/16 = 16 transactions. • Total number of cycles = 16 * 57 = 912 cycles. • This is equal to 912 * 5ns = 4560ns. • Hence total number of transactions per second is equal to: 16 * (1s/4,560ns) = 3.51 M transactions/s • A total of (256 * 4) bytes can be sent in 4,560 ns, so in one second we can have a throughput of: (256 *4)/(1s/4560ns) = 224.56 MB/s Polling vs. Interrupts • After a CPU has requested for an I/O operation, it can do one of two things to see if the operation has been completed: – Keep checking the device to see if operation is complete - Polling – Go do other stuff, and when the device completes the operation, it will tell the CPU - Interrupts. Polling • In a polling scheme, the devices have special registers called “Status Registers”. • Status registers contain flags that indicate if an operation has completed, error conditions etc. • The CPU will periodically check these registers until either a flag indicates that the operation has completed, or another flag indicates an error condition. Why is Polling Good? • Polling is simple to implement – Just need flip-flops to indicate flag status. – Most of the work is done in software • Cheap! Simple to design hardware! Why is Polling Bad? • Polling basically works like this: – You are expecting a phone call. – Your phone does not have a ringer – You spend the entire day randomly picking up the phone to see if the other person is on the other end of the line. – If he isn’t you put the phone back down and try again later. – If he is, you start talking. • You can waste a lot of time doing this! Polling Example • Suppose we have a hard-disk with throughput of 4 MB/s. The disk transfers in 4-word chunks. The drives actually transfer data only 5% of the time. • How many times per second does the CPU need to poll the disk so that no data is lost? If the CPU speed is 500MHz, and if polls require 400 cycles, what portion of CPU time is spent polling? Polling Example • Analysis: – CPU has to poll regardless of whether drive is actually transferring data or not!! – Data is transferred at 4MB/second, in 4-word (i.e. 16-byte) chunks. – Therefore number of polls required/second is 4MB/16 which is equal to 250,000 polls. – Each poll takes 400 cycles, so total number of cycles for polling is 250,000 x 400 = 100,000,000 cycles! – Proportion of CPU time spent = 100x10^6/500x10^6 which is equal to 20% – So 20% of the time is just spent transferring data. Inefficient. Interrupts • Alternative: – CPU makes request, does other things. – When I/O device is done, it will inform the CPU via an interrupt. • This is like having a telephone with a ringer – You pick the phone up and talk only when it rings. – More efficient! Interrupt Example • Suppose we have the same disk arrangement as before, but this time interrupts are used instead of polling. Find the fraction of processor time taken to process an interrupt, given that it takes 500 cycles to process an interrupt and that the disk sends data 5% of the time. Interrupt Example • Analysis: – Each time the disk wants to transfer a 4-word (i.e. 16 byte) block, it will interrupt the processor. Number of interrupts per second would be 4MB/16 = 250,000 interrupts per second. – Number of cycles per interrupt = 500 – Therefore number of cycles per second to service interrupt is 500 x 250,000 = 125,000,000 or 125x10^6. – Percentage of CPU time spent processing interrupts per second is now (125x10^6)/(500x10^6) = 25% – Worse than before!! – BUT interrupts occur only when the drive actually has data to transfer! • This happens only 5% of the time! – Therefore actual percentage of CPU time used per second is 5% of 25% = 1.25%. Polling vs. Interrupts The Conclusion • Numerical examples show that polling is very expensive. – CPU has no way of knowing whether a device needs attention, and so has to keep polling repeatedly so as not to miss data. – In our example, even if the drive is idle 95% of the time, CPU still has to poll. • Interrupts allow CPU to do useful work during this 95% of the time. – So even if processing an interrupt takes longer (500 cycles vs. 400 cycles for polling, in the end a smaller portion of CPU time is used to process the device (1.25% vs. 20%). Interrupt Implementation • How are interrupts implemented? – Most CPUs have multiple interrupt lines (e.g. 1 interrupt for the disk drive, 1 for the keyboard etc.). – Each interrupt will have its own interrupt handler • Basically we don’t want a keyboard driver to be processing disk interrupts! – This means that when we have an interrupt, we need a way of knowing which handler we should call. • We really don’t want to call the drive handler when we have network messages. – The currently executing process is suspended as the handler is called. • Handler must save registers etc. so that the interrupted process can resume properly later on. Interrupt Implementation • Two solutions: – Interrupt Vector Tables • Interrupts are assigned numbers (e.g. Drive interrupt may be interrupt 1, network interrupts may be interrupt 2 etc.). • When an interrupt occurs, the CPU will check which interrupt it was, and get that interrupt’s number. • It will use the number and consult a look-up table. The table will tell the processor which handler to use. • The processor hands control over to that handler. • The look-up table used is called an Interrupt Vector Table. Table look-ups and handing-over of control (vectoring) is handled completely in hardware. • This scheme is used in INTEL processors. Interrupt Implementation • Second option: – Again we have multiple numbered interrupts just like before. – When an interrupt occurs, the interrupt number is placed into a “Cause Register”. – A centralized handler is called. It will read the Cause Register, and based on that it will call the appropriate routine. – Conceptually similar to the first option, except that the vectoring is done in software instead of hardware. – This technique is used in the MIPS R2000. Interrupt Implementation • Interrupts are prioritized – If 2 interrupts occur together, the higher priority one gets processed first. • Re-entrancy – Sometimes as an interrupt is being processed, the interrupt may occur again. – This will cause the interrupt handler to itself be interrupted. – The handler must be careful to save registers etc. in case something like this happens, so that it can restart correctly. Direct Memory Access • So far all the devices we’ve seen rely on the CPU to perform the transfers, operations, etc. – E.g. when the drive interrupts the CPU, the CPU has to go to the drive’s data registers, read the 4-words that the drive has just retrieved, and store them into the buffers in memory. – Consumes valuable CPU cycles. Direct Memory Access • A better idea would be to have a dedicated device (the Direct Memory Access Controller or DMAC) to do these transfers for us. – CPU sends a request to the DMAC. • Request includes information like where to get the data from (side#, track#, block# for disk, for example), how many words of data to transfer, and when to begin the transfer. • DMAC starts the transfer, copying the data directly between the device and memory (hence the name Direct Memory Access). – DMAC needs to arbitrate for memory bus. – When the DMAC is done, it will notify the CPU via interrupts (can also be via CPU polling). Summary • I/O is important so that we can interact with computers, and computers can interact with each other to transfer data etc. • I/O devices include stuff like keyboards, monitors, disks, network cards (NIC), etc. • I/O may be supported by polling or interrupts – Polling wastes CPU time, but simple to design hardware – Interrupts more complex, interrupt handling may take longer, but overall CPU cycles are saved. Summary • There are 3 main types of buses – Memory Bus – I/O Bus – Backplane Bus • Most systems are made of combinations of 2 or 3 of these buses. • Bus arbitration is required if >1 device can read/write the bus. Summary • Arbitration can be centralized or distributed. • Several schemes can be used to implement arbitration – Round-robin – Daisy-Chain – Fixed.