Make Protocol Ready for Gigabit Scopes • In this presentation, we will present various protocol design and implementation techniques that can allow a protocol to function correctly on Gbps or deliver Gbps performance to the user application or the system output. • (In the previous presentation, what we presented were operating system design and implementation techniques for supporting Gbps network) Protect Against Wrapped Sequence Number Sequence Number Wrapping Around • TCP uses sequence numbers to help it detect packet loss and duplicate packets. • TCP’s sequence number is in bytes, rather than in packets. The length of the sequence number field is 32 bits. • On a gigabit network, it would take only 32 seconds to wrap around a sequence number! The TTL field in the IP header just limits the maximum number of hops a packet can traverse. It does not limit the maximum amount of time a packet can stay in the network. • Wrapping around a sequence number can result in wrong comparisons of the freshness of two sequence numbers. This can have a very bad effect. Problem 1: Sequence Number Drops to Zero • Suppose that the length of the sequence number field is n bits, then when the sequence number grows from 0, 1, … to 2^n, the sequence number will wrap and drop to zero. • As a result, when we compare two sequence numbers a and b where b > a, the comparison result may be wrong. • The effect of the wrong comparison is that a more recent packet carrying b will be rejected and discarded because it is considered older than the the packet (carrying a) that is already received. Sequence Number Wheel to Avoid Problem 1 • To avoid the comparison problem, we can use a sequence number wheel scheme. • The sequence number space (N = 2 ^n ) is divided into two parts each of which is (N/2) large. • The division line is not fixed. It is floating with the sequence number (e.g., a) to be compared with. • One part represents all the sequence numbers that are considered as larger than a. – a < b if |a – b| < N/2 and a < b, or |a – b| > N/2 and a > b • The other part represents all the sequence numbers that are considered as smaller than b. – Otherwise. Sequence Number Wheel Problem 2: Sequence Number Wraps and Grows Up • In Gbps network, in just 32 seconds, a sequence number can wrap and grow to the same number (e.g., a -> 0 -> a -1). • This means that an outdated packet (carrying a) that stays in the network for a long time (e.g., 32 seconds) may look like that it is exactly the next packet that the receiver expects to receive. (because the last packet received carries a – 1). • This problem may result in a corrupted received file. • This problem cannot be solved by the sequence number wheel scheme. PAWS Used to Detect Problem 2 • PAWS (Protect against wrapped sequence numbers: RFC 1323) is a scheme used in tcp_input() to detect problem 2. • PAWS is based on the premise that the 32-bit timestamp values wrap around at a much lower frequency than the 32-bit sequence number, on a high-speed network. – The TCP timestamp option is assumed to be used in the TCP header. – Right now in FreeeBSD 4.x, one tick used in timestamp represents 1 ms. Therefore, it needs about 24 days (1193 hours) to change the sign bit. PAWS Used to Detect Problem 2 • Therefore, when tcp_input() receives a packet, it will first check whether the new packet’s timestamp is older (smaller) than the timestamp of the lastly received packet. • Of course, if the TCP connection has been idle for more than 24 days, its timestamp may have wrapped around, which can make the comparison wrong. – In this case, the packet will not be dropped. • Otherwise, the packet will be dropped. TCP Checksum Offloading Turn off Checksum Computation • Computing checksum is very expensive. – Every byte of a packet need to be read from memory to the CPU, be added, and then written fro memory back to memory. – CPU cycles are wasted. – The bandwidth of the CPU-memory bus or the memory system (depending on which one is smaller) is wasted. – Therefore, on Gbps networks, how to avoid or reduce the checksum cost becomes an important topic. • Solution 1: Do not calculate checksum at all – E.g., right now on FreeBSD 4.x, you can turn off UDP packets’ checksum computation. Checksum Offloading • Solution 2: Let the network interface card do the checksum computation. (IP header checksum and TCP data payload checksum) • Nowadays almost every Gigabit Ethernet NIC supports computing checksum on the NIC. For example, the 3COM if_ti.c driver. • To take advantage of this hardware function, the NIC device driver needs to communicate with TCP code so that tcp_output() and tcp_input() know whether they should compute checksums. • Checksum offload risks that the errors occurring between the TCP layer and the device driver cannot be detected by the receiver! Free Checksum Computation on some RISC Processors • Sometimes, on a computer system with a RISC processor, the checksum computation can be performed without any cost. • Some researchers observed that most RISC processors can perform two instructions per clock cycle, of which only one operation can be loading data from memory or storing data to memory. • Thus, there is space for two instructions in the following copy loop instructions – Load – Store [%r0] %r2 %r2 [%r1] Free Checksum Computation on some RISC Processors • As a result, if we add two instructions that calculate checksum in between these two load and store instructions, the checksum can be calculated for free. (No other work can be done in between the load and store instructions anyway) – – – – Load Add Addc Store [%r0] %r5 %r5 %r2 %r2 %r2 %r5 ! Add to running sum in r5 #0 %r5 ! Add carry into r5 [%r1] • This example also shows that programmed I/O is not always worse than DMA. TCP Header Prediction for Gigabit Networks TCP Implementation is Complicated • In FreeBSD 4.2, tcp_input.c has 2797 lines of C code and tcp_output.c has 939 lines. There are 339 lines of “if” statements in tcp_input.c and 126 lines in tcp_output.c. • These numbers show: – TCP processing is complicated. – TCP input processing is more complicated than TCP output processing. • Previously, we presented that locality is important for good cache performance • And, because conditional branches can hurt the performance of a pipelined CPU a lot, their bad effect should be minimized. TCP Header Prediction • Header prediction looks for packets that fit the profile of the packets that the receiver expects to receive next. • If a packet meets the header prediction condition, it will be handled in just a few instructions. Otherwise, it will be handled by a general-processing code. • Actually, this is an old design principle – optimize for the common case! • TCP header prediction scheme can improve TCP transfer throughput because it improves instruction locality, which can improve cache performance. Two Common Cases • If TCP is sending data, the next expected segment for this connection is an ACK for outstanding data. • If TCP is receiving data, the next expected segment for this connection is the next in-sequence data segment. • On LAN, where packet losses are rare, the header prediction works between 97 and 100%. On WAN, where packet losses are more possible, the percentage drops to between 83% and 99%. • The code for processing these two common cases is placed at the beginning of tcp_input(). This results in a better cache performance. – There is no information about how well this scheme can improve TCP transfer throughput though. Measuring RTTs on Gigabit Networks Measuring RTTs Is Important • To more efficiently retransmit lost packets, the RTT of a connection should be correctly and precisely measured. – When a packet is lost, we do not want to wait unnecessarily long before resending it. • To increase the accuracy of measurements, the first step is to use a high-resolution clock. – Before FreeBSD 3.0, the clock resolution for TCP RTT measurement is 500 ms. – Now it becomes 1 ms. • The second step is to use more RTT measurement samples to calculate the average RTT. Timer Management • Timers are extensively used to measure RTTs. – When a packet is sent, a timer is started. When the corresponding ACK returns, the timer is stopped. The elapsed time of the timer represents one RTT sample. • The above approach can be used on a low speed network such as 10 Mbps. On a gigabit network, where a 1500-byte packet is sent every 12 microseconds, this approach is infeasible. • To reduce the high frequency of timer setup/cancel operations, the original solution is to get a RTT sample every RTT, rather than getting a RTT sample for every sent packet. – Simple to implement. Just do not set up another timer until the previous timer is cancelled. – However, the accuracy suffers. TCP Timestamp Option • To get a RTT sample for every sent packet while avoiding the need to setup/cancel timers at a high frequency, TCP uses a timestamp option (RFC 1323). • The sender places a timestamp in every sent segment. The receiver sends the timestamp back in the ACK. This allows the sender to calculate the difference and use it as the RTT sample. – This option must be supported by both the TCP sender and receiver. – However, the original design does not need support from the receiver. TCP Window Scale Option for Gigabit Networks TCP Maximum Throughput • A TCP connection’s maximum achievable throughput is limited by the minimum of the TCP sender’s socket send buffer and the TCP receiver’s socket receive buffer. – Min(socket send buffer on the sender, socket receive buffer on the receiver) / RTT. • Although we can use setsockopt() to enlarge the socket send and receive buffer to a big value, the advertised window field in the TCP header is only 16 bits, which means a maximum window size of 64 KB only. – On gigabit networks, clearly this is not enough. TCP Window Scale Option • In this scheme, the definition of the TCP window is enlarged from 16 to 32 bits. • The window field in the header still uses 16 bits, but a option is defined that applies a scaling operation to the 16-bit values. – During the 3-way handshaking phase, this option is carried in the SYN and SYN+ACK packets to indicate whether the option is supported. • In TCP implementation, the real window size is internally maintained as a 32-bit value. • The shift field in the option is 1-byte long. – 0 means no scaling is performed. – 14 is the maximum, allowing 64KB * 2^14. TCP Window Scale Option • This option can only appear in a SYN packet. Therefore, the scale factor is fixed in each direction when the connection is established. • The shift count is automatically calculated by TCP, based on the size of the socket receive buffer. – Keep dividing the buffer size by 2 until the resulting number is less than 64 KB. • Each host thus maintains two shift counts – S for sending and R for receiving. – Every 16-bit advertised window that is received from the other end is left-shifted by R bits to obtain the real advertised window. – When we need to send a window advertisement to the other end, the real window size is right-shifted by S bits, and the resulting 16-bit value is placed in the TCP header. Congestion Control on Gigabit Networks Congestion Control • Congestion control is more difficult on Gigabit networks. – Although the absolute control delay (a connection’s RTT) may remain about the same regardless of the link bandwidth (either 10, 100, or 1000 Mbps), the cost of the control delay becomes higher on higher-bandwidth network. – Why? During the same RTT, a larger amount of data has been injected into the network, before the control packet arrives at the traffic source to reduce its sending rate. – For example, in Gigabit Ethernet 802.3x PAUSE flow control scheme, one RTT is required for the pause packet to take effect. Therefore, as the bandwidth increases, more data will be sent before the congestion control starts to take effect. – The result is that congestion control becomes less and less effective on high-bandwidth networks. (No solution!)