What’s the Internet: “nuts and bolts” view millions of connected computing devices: hosts = end systems examples of hosts? router workstation server mobile local ISP running network apps examples of applications? regional ISP company network Transport Layer 3-1 What’s the Internet: “nuts and bolts” view communication links fiber, copper, radio, satellite transmission rate = bandwidth • typical bandwidth for modem? wireless? router workstation server mobile local ISP regional ISP routers: forward packets (chunks of data) what’s in a packet? company network Transport Layer 3-2 What’s the Internet: “nuts and bolts” view protocols control sending, receiving of msgs e.g., TCP, IP, HTTP, FTP, PPP Internet: “network of networks” loosely hierarchical public Internet versus private intranet router server mobile local ISP Internet standards workstation regional ISP RFC: Request for comments IETF: Internet Engineering Task Force company network Transport Layer 3-3 What’s the Internet: a service view communication infrastructure enables distributed applications: Web, email, other examples? communication services provided to apps: connection-oriented reliable • example apps? Connectionless unreliable • example apps? Transport Layer 3-4 What’s a protocol? human protocols: “what’s the time?” “I have a question” introductions network protocols: machines rather than humans all communication activity in Internet governed by protocols protocols define format, order of msgs sent and received among network entities, and actions taken on msg transmission, receipt Transport Layer 3-5 What’s a protocol? a human protocol and a computer network protocol: Hi TCP connection req Hi TCP connection response Got the time? Get http://www.awl.com/kurose-ross 2:00 <file> time Q: Why are protocols so important? Transport Layer 3-6 A closer look at network structure: network edge: applications and hosts network core: routers network of networks access networks, physical media: communication links Transport Layer 3-7 Protocol “Layers” Networks are complex! many “pieces”: hosts routers links of various media applications protocols hardware, software Question: Is there any hope of organizing structure of network? Or at least our discussion of networks? Transport Layer 3-8 Organization of air travel ticket (purchase) ticket (complain) baggage (check) baggage (claim) gates (load) gates (unload) runway takeoff runway landing airplane routing airplane routing airplane routing a series of steps Transport Layer 3-9 Layering of airline functionality ticket (purchase) ticket (complain) ticket baggage (check) baggage (claim baggage gates (load) gates (unload) gate runway (takeoff) runway (land) takeoff/landing airplane routing airplane routing airplane routing departure airport airplane routing airplane routing intermediate air-traffic control centers arrival airport Layers: each layer implements a service via its own internal-layer actions relying on services provided by layer below Transport Layer 3-10 Why layering? Dealing with complex systems: explicit structure allows identification, relationship of complex system’s pieces layered reference model for discussion modularization eases maintenance, updating of system change of implementation of layer’s service transparent to rest of system e.g., change in gate procedure doesn’t affect rest of system layering considered harmful? Transport Layer 3-11 Internet protocol stack application: supporting network applications FTP, SMTP, STTP transport: host-host data transfer TCP, UDP network: routing of datagrams from source to destination transport IP, routing protocols link: data transfer between neighboring network elements application PPP, Ethernet physical: bits “on the wire” network link physical Transport Layer 3-12 source message segment Ht datagram Hn Ht frame Hl Hn Ht M M M M Encapsulation application transport network link physical Hl Hn Ht M link physical Hl Hn Ht M switch destination M Ht M Hn Ht Hl Hn Ht M M application transport network link physical Hn Ht Hl Hn Ht M M network link physical Hn Ht Hl Hn Ht M M router Transport Layer 3-13 Internet transport protocols services TCP service: connection-oriented: setup required between client and server processes reliable transport between sending and receiving process flow control: sender won’t overwhelm receiver congestion control: throttle sender when network overloaded does not provide: timing, minimum bandwidth guarantees UDP service: unreliable data transfer between sending and receiving process does not provide: connection setup, reliability, flow control, congestion control, timing, or bandwidth guarantee Q: why bother? Why is there a UDP? Transport Layer 3-14 Transport vs. network layer network layer: logical communication between hosts transport layer: logical communication between processes relies on, enhances, network layer services Household analogy: 12 kids sending letters to 12 kids processes = kids app messages = letters in envelopes hosts = houses transport protocol = Ann and Bill network-layer protocol = postal service Transport Layer 3-15 Reliable data transfer: getting started We’ll: incrementally develop sender, receiver sides of reliable data transfer protocol (rdt) consider only unidirectional data transfer but control info will flow on both directions! use finite state machines (FSM) to specify sender, receiver state: when in this “state” next state uniquely determined by next event state 1 event causing state transition actions taken on state transition event actions state 2 Transport Layer 3-16 Rdt1.0: reliable transfer over a reliable channel underlying channel perfectly reliable no bit errors no loss of packets separate FSMs for sender, receiver: sender sends data into underlying channel receiver read data from underlying channel Wait for call from above rdt_send(data) packet = make_pkt(data) udt_send(packet) sender Wait for call from below rdt_rcv(packet) extract (packet,data) deliver_data(data) receiver Transport Layer 3-17 Rdt2.0: channel with bit errors underlying channel may flip bits in packet checksum to detect bit errors the question: how to recover from errors: acknowledgements (ACKs): receiver explicitly tells sender that pkt received OK negative acknowledgements (NAKs): receiver explicitly tells sender that pkt had errors sender retransmits pkt on receipt of NAK new mechanisms in rdt2.0 (beyond rdt1.0): error detection receiver feedback: control msgs (ACK,NAK) rcvr->sender Transport Layer 3-18 rdt2.0: FSM specification rdt_send(data) snkpkt = make_pkt(data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && isNAK(rcvpkt) Wait for Wait for call from ACK or udt_send(sndpkt) above NAK rdt_rcv(rcvpkt) && isACK(rcvpkt) L sender receiver rdt_rcv(rcvpkt) && corrupt(rcvpkt) udt_send(NAK) Wait for call from below rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data) deliver_data(data) udt_send(ACK) Transport Layer 3-19 rdt2.0: operation with no errors rdt_send(data) snkpkt = make_pkt(data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && isNAK(rcvpkt) Wait for Wait for call from ACK or udt_send(sndpkt) above NAK rdt_rcv(rcvpkt) && isACK(rcvpkt) L rdt_rcv(rcvpkt) && corrupt(rcvpkt) udt_send(NAK) Wait for call from below rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data) deliver_data(data) udt_send(ACK) Transport Layer 3-20 rdt2.0: error scenario rdt_send(data) snkpkt = make_pkt(data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && isNAK(rcvpkt) Wait for Wait for call from ACK or udt_send(sndpkt) above NAK rdt_rcv(rcvpkt) && isACK(rcvpkt) L rdt_rcv(rcvpkt) && corrupt(rcvpkt) udt_send(NAK) Wait for call from below rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data) deliver_data(data) udt_send(ACK) Transport Layer 3-21 rdt2.0 has a fatal flaw! Stop and Wait Sender sends one packet, then waits for receiver response What happens if ACK/NAK corrupted? sender doesn’t know what happened at receiver! can’t just retransmit: possible duplicate Handling duplicates: sender adds sequence number to each pkt sender retransmits current pkt if ACK/NAK garbled receiver discards (doesn’t deliver up) duplicate pkt Transport Layer 3-22 rdt2.1: sender, handles garbled ACK/NAKs rdt_send(data) sndpkt = make_pkt(0, data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || Wait for Wait for isNAK(rcvpkt) ) ACK or call 0 from udt_send(sndpkt) NAK 0 above rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt) rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt) L rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isNAK(rcvpkt) ) udt_send(sndpkt) L Wait for ACK or NAK 1 Wait for call 1 from above rdt_send(data) sndpkt = make_pkt(1, data, checksum) udt_send(sndpkt) Transport Layer 3-23 rdt2.1: receiver, handles garbled ACK/NAKs rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq0(rcvpkt) rdt_rcv(rcvpkt) && (corrupt(rcvpkt) extract(rcvpkt,data) deliver_data(data) sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && (corrupt(rcvpkt) sndpkt = make_pkt(NAK, chksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && not corrupt(rcvpkt) && has_seq1(rcvpkt) sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt) sndpkt = make_pkt(NAK, chksum) udt_send(sndpkt) Wait for 0 from below Wait for 1 from below rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq1(rcvpkt) rdt_rcv(rcvpkt) && not corrupt(rcvpkt) && has_seq0(rcvpkt) sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt) extract(rcvpkt,data) deliver_data(data) sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt) Transport Layer 3-24 rdt2.1: discussion Sender: seq # added to pkt two seq. #’s (0,1) will suffice. Why? must check if received ACK/NAK corrupted twice as many states state must “remember” whether “current” pkt has 0 or 1 seq. # Receiver: must check if received packet is duplicate state indicates whether 0 or 1 is expected pkt seq # note: receiver can not know if its last ACK/NAK received OK at sender Transport Layer 3-25 rdt2.2: a NAK-free protocol same functionality as rdt2.1, using ACKs only instead of NAK, receiver sends ACK for last pkt received OK receiver must explicitly include seq # of pkt being ACKed duplicate ACK at sender results in same action as NAK: retransmit current pkt Transport Layer 3-26 rdt2.2: sender, receiver fragments rdt_send(data) sndpkt = make_pkt(0, data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || Wait for Wait for isACK(rcvpkt,1) ) ACK call 0 from 0 udt_send(sndpkt) above sender FSM fragment rdt_rcv(rcvpkt) && (corrupt(rcvpkt) || has_seq1(rcvpkt)) udt_send(sndpkt) Wait for 0 from below rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,0) receiver FSM fragment L rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq1(rcvpkt) extract(rcvpkt,data) deliver_data(data) sndpkt = make_pkt(ACK1, chksum) udt_send(sndpkt) Transport Layer 3-27 rdt3.0: channels with errors and loss New assumption: underlying channel can also lose packets (data or ACKs) checksum, seq. #, ACKs, retransmissions will be of help, but not enough Approach: sender waits “reasonable” amount of time for ACK retransmits if no ACK received in this time if pkt (or ACK) just delayed (not lost): retransmission will be duplicate, but use of seq. #’s already handles this receiver must specify seq # of pkt being ACKed requires countdown timer Transport Layer 3-28 rdt3.0 sender rdt_send(data) sndpkt = make_pkt(0, data, checksum) udt_send(sndpkt) start_timer rdt_rcv(rcvpkt) L rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,1) rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,0) ) timeout udt_send(sndpkt) start_timer rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,0) stop_timer stop_timer timeout udt_send(sndpkt) start_timer L Wait for ACK0 Wait for call 0from above L rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,1) ) Wait for ACK1 Wait for call 1 from above rdt_send(data) rdt_rcv(rcvpkt) L sndpkt = make_pkt(1, data, checksum) udt_send(sndpkt) start_timer Transport Layer 3-29 rdt3.0 in action Transport Layer 3-30 rdt3.0 in action Transport Layer 3-31 Performance of rdt3.0 rdt3.0 works, but performance stinks Why? Transport Layer 3-32 rdt3.0: stop-and-wait operation sender receiver first packet bit transmitted, t = 0 last packet bit transmitted, t = L / R RTT first packet bit arrives last packet bit arrives, send ACK ACK arrives, send next packet, t = RTT + L / R Transport Layer 3-33 Pipelined protocols Pipelining: sender allows multiple, “in-flight”, yet-to-beacknowledged pkts range of sequence numbers must be increased buffering at sender and/or receiver Two generic forms of pipelined protocols: go-Back-N, selective repeat Transport Layer 3-34 Pipelining: increased utilization sender receiver first packet bit transmitted, t = 0 last bit transmitted, t = L / R RTT first packet bit arrives last packet bit arrives, send ACK last bit of 2nd packet arrives, send ACK last bit of 3rd packet arrives, send ACK ACK arrives, send next packet, t = RTT + L / R Transport Layer 3-35 Go-Back-N Sender: k-bit seq # in pkt header “window” of up to N, consecutive unack’ed pkts allowed ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK” may deceive duplicate ACKs (see receiver) timer for each in-flight pkt timeout(n): retransmit pkt n and all higher seq # pkts in window Transport Layer 3-36 GBN: receiver ACK-only: always send ACK for correctlyreceived pkt with highest in-order seq # may generate duplicate ACKs need only remember expectedseqnum out-of-order pkt: discard (don’t buffer) -> isn't this bad? Re-ACK pkt with highest in-order seq # Transport Layer 3-37 GBN in action Transport Layer 3-38 Selective Repeat receiver individually acknowledges all correctly received pkts buffers pkts, as needed, for eventual in-order delivery to upper layer sender only resends pkts for which ACK not received sender timer for each unACKed pkt sender window N consecutive seq #’s again limits seq #s of sent, unACKed pkts Transport Layer 3-39 Selective repeat: sender, receiver windows Transport Layer 3-40 Selective repeat sender data from above : receiver pkt n in [rcvbase, rcvbase+N-1] if next available seq # in send ACK(n) timeout(n): in-order: deliver (also deliver window, send pkt resend pkt n, restart timer ACK(n) in [sendbase,sendbase+N]: mark pkt n as received if n smallest unACKed pkt, advance window base to next unACKed seq # out-of-order: buffer buffered, in-order pkts), advance window to next notyet-received pkt pkt n in [rcvbase-N,rcvbase-1] ACK(n) otherwise: ignore Transport Layer 3-41 Selective repeat in action Transport Layer 3-42 Selective repeat: dilemma Example: seq #’s: 0, 1, 2, 3 window size=3 receiver sees no difference in two scenarios! incorrectly passes duplicate data as new in (a) Q: what relationship between seq # size and window size? Transport Layer 3-43 TCP: Overview full duplex data: point-to-point: RFCs: 793, 1122, 1323, 2018, 2581 one sender, one receiver reliable, in-order byte steam: no “message boundaries” pipelined: connection-oriented: TCP congestion and flow control set window size send & receive buffers socket door application reads data TCP send buffer TCP receive buffer handshaking (exchange of control msgs) init’s sender, receiver state before data exchange flow controlled: application writes data bi-directional data flow in same connection MSS: maximum segment size sender will not overwhelm receiver socket door segment Transport Layer 3-44 TCP segment structure 32 bits URG: urgent data (generally not used) ACK: ACK # valid PSH: push data now (generally not used) RST, SYN, FIN: connection estab (setup, teardown commands) Internet checksum (as in UDP) source port # dest port # sequence number acknowledgement number head not UA P R S F len used checksum Receive window Urg data pnter Options (variable length) counting by bytes of data (not segments!) # bytes rcvr willing to accept application data (variable length) Transport Layer 3-45 TCP seq. #’s and ACKs Seq. #’s: byte stream “number” of first byte in segment’s data ACKs: seq # of next byte expected from other side cumulative ACK piggybacking Q: how receiver handles out-oforder segments A: TCP spec doesn’t say, - up to implementor Host A User types ‘C’ Host B host ACKs receipt of ‘C’, echoes back ‘C’ host ACKs receipt of echoed ‘C’ simple telnet scenario time Transport Layer 3-46 TCP Round Trip Time and Timeout Q: how to set TCP timeout value? longer than RTT but RTT varies too short: premature timeout unnecessary retransmissions too long: slow reaction to segment loss Q: how to estimate RTT? SampleRTT: measured time from segment transmission until ACK receipt ignore retransmissions SampleRTT will vary, want estimated RTT “smoother” average several recent measurements, not just current SampleRTT Transport Layer 3-47 Example RTT estimation: RTT: gaia.cs.umass.edu to fantasia.eurecom.fr 350 RTT (milliseconds) 300 250 200 150 100 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 time (seconnds) SampleRTT Estimated RTT Transport Layer 3-48 TCP reliable data transfer TCP creates rdt service on top of IP’s unreliable service Pipelined segments Cumulative acks TCP uses single retransmission timer Retransmissions are triggered by: timeout events duplicate acks Initially consider simplified TCP sender: ignore duplicate acks ignore flow control, congestion control Transport Layer 3-49 TCP sender events: data rcvd from app: Create segment with seq # seq # is byte-stream number of first data byte in segment start timer if not already running (think of timer as for oldest unacked segment) expiration interval: TimeOutInterval timeout: retransmit segment that caused timeout restart timer Ack rcvd: If acknowledges previously unacked segments update what is known to be acked start timer if there are outstanding segments Transport Layer 3-50 NextSeqNum = InitialSeqNum SendBase = InitialSeqNum loop (forever) { switch(event) event: data received from application above create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data) event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } } /* end of loop forever */ TCP sender (simplified) Comment: • SendBase-1: last cumulatively ack’ed byte Example: • SendBase-1 = 71; y= 73, so the rcvr wants 73+ ; y > SendBase, so that new data is acked Transport Layer 3-51 TCP: retransmission scenarios Host A X loss Sendbase = 100 SendBase = 120 SendBase = 100 time SendBase = 120 lost ACK scenario Host B Seq=92 timeout Host B Seq=92 timeout timeout Host A time premature timeout Transport Layer 3-52 TCP retransmission scenarios (more) timeout Host A Host B X loss SendBase = 120 time Cumulative ACK scenario Transport Layer 3-53 TCP ACK generation [RFC 1122, RFC 2581] Event at Receiver TCP Receiver action Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK Arrival of in-order segment with expected seq #. One other segment has ACK pending Immediately send single cumulative ACK, ACKing both in-order segments Arrival of out-of-order segment higher-than-expect seq. # . Gap detected Immediately send duplicate ACK, indicating seq. # of next expected byte Arrival of segment that partially or completely fills gap Immediate send ACK, provided that segment startsat lower end of gap Transport Layer 3-54 Fast Retransmit Time-out period often relatively long: long delay before resending lost packet Detect lost segments via duplicate ACKs. Sender often sends many segments back-toback If segment is lost, there will likely be many duplicate ACKs. If sender receives 3 ACKs for the same data, it supposes that segment after ACKed data was lost: fast retransmit: resend segment before timer expires Transport Layer 3-55 TCP Flow Control receive side of TCP connection has a receive buffer: flow control sender won’t overflow receiver’s buffer by transmitting too much, too fast speed-matching app process may be service: matching the send rate to the receiving app’s drain rate slow at reading from buffer Transport Layer 3-56 TCP Flow control: how it works Rcvr advertises spare (Suppose TCP receiver discards out-of-order segments) spare room in buffer room by including value of RcvWindow in segments Sender limits unACKed data to RcvWindow guarantees receive buffer doesn’t overflow = RcvWindow = RcvBuffer-[LastByteRcvd LastByteRead] Transport Layer 3-57 TCP Connection Management Recall: TCP sender, receiver establish “connection” before exchanging data segments initialize TCP variables: seq. #s buffers, flow control info (e.g. RcvWindow) client: connection initiator Socket clientSocket = new Socket("hostname","port number"); server: contacted by client Socket connectionSocket = welcomeSocket.accept(); Transport Layer 3-58 TCP Connection Management Three way handshake: Step 1: client host sends TCP SYN segment to server specifies initial seq # no data Step 2: server host receives SYN, replies with SYNACK segment server allocates buffers specifies server initial seq. # Step 3: client receives SYNACK, replies with ACK segment, which may contain data Transport Layer 3-59 TCP Connection Management (cont.) Closing a connection: client closes socket: clientSocket.close(); client server close Step 1: client end system sends TCP FIN control segment to server close replies with ACK. Closes connection, sends FIN. timed wait Step 2: server receives FIN, closed Transport Layer 3-60 TCP Connection Management (cont.) Step 3: client receives FIN, replies with ACK. Enters “timed wait” - will respond with ACK to received FINs client server closing closing Step 4: server, receives ACK. Note: with small modification, can handle simultaneous FINs. timed wait Connection closed. closed closed Transport Layer 3-61 TCP Congestion Control end-end control (no network assistance) sender limits transmission: LastByteSent-LastByteAcked CongWin Roughly, CongWin is dynamic, function of CongWin rate = network congestion Bytes/sec perceived RTT How does sender perceive congestion? loss event = timeout or 3 duplicate acks TCP sender reduces rate (CongWin) after loss event three mechanisms: AIMD slow start conservative after timeout events Transport Layer 3-62 TCP AIMD multiplicative decrease: cut CongWin in half after loss event congestion window additive increase: increase CongWin by 1 MSS every RTT in the absence of loss events: probing 24 Kbytes 16 Kbytes 8 Kbytes time Long-lived TCP connection Transport Layer 3-63 TCP Slow Start When connection begins, CongWin = 1 MSS Example: MSS = 500 bytes & RTT = 200 msec initial rate = 20 kbps When connection begins, increase rate exponentially fast until first loss event available bandwidth may be >> MSS/RTT desirable to quickly ramp up to respectable rate Transport Layer 3-64 TCP Slow Start (more) When connection Host B RTT begins, increase rate exponentially until first loss event: Host A double CongWin every RTT done by incrementing CongWin for every ACK received Summary: initial rate is slow but ramps up exponentially fast time Transport Layer 3-65 Refinement After 3 dup ACKs: CongWin is cut in half window then grows linearly But after timeout event: CongWin instead set to 1 MSS; window then grows exponentially to a threshold, then grows linearly Philosophy: • 3 dup ACKs indicates network capable of delivering some segments • timeout before 3 dup ACKs is “more alarming” Transport Layer 3-66 Refinement (more) Q: When should the exponential increase switch to linear? A: When CongWin gets to 1/2 of its value before timeout. Implementation: Variable Threshold At loss event, Threshold is set to 1/2 of CongWin just before loss event Transport Layer 3-67 Summary: TCP Congestion Control When CongWin is below Threshold, sender in slow-start phase, window grows exponentially. When CongWin is above Threshold, sender is in congestion- avoidance phase, window grows linearly. When a triple duplicate ACK occurs, Threshold set to CongWin/2 and CongWin set to Threshold. When timeout occurs, Threshold set to CongWin/2 and CongWin is set to 1 MSS. Transport Layer 3-68 TCP sender congestion control Event State TCP Sender Action Commentary ACK receipt for previously unacked data Slow Start (SS) CongWin = CongWin + MSS, If (CongWin > Threshold) set state to “Congestion Avoidance” Resulting in a doubling of CongWin every RTT ACK receipt for previously unacked data Congestion Avoidance (CA) CongWin = CongWin+MSS * (MSS/CongWin) Additive increase, resulting in increase of CongWin by 1 MSS every RTT Loss event detected by triple duplicate ACK SS or CA Threshold = CongWin/2, CongWin = Threshold, Set state to “Congestion Avoidance” Fast recovery, implementing multiplicative decrease. CongWin will not drop below 1 MSS. Timeout SS or CA Threshold = CongWin/2, CongWin = 1 MSS, Set state to “Slow Start” Enter slow start Duplicate ACK SS or CA Increment duplicate ACK count for segment being acked CongWin and Threshold not changed Transport Layer 3-69 Interplay between routing and forwarding routing algorithm local forwarding table header value output link 0100 0101 0111 1001 3 2 2 1 value in arriving packet’s header 0111 1 3 2 Transport Layer 3-70 Graph abstraction 5 2 u 2 1 Graph: G = (N,E) v x 3 w 3 1 5 z 1 y 2 N = set of routers = { u, v, w, x, y, z } E = set of links ={ (u,v), (u,x), (v,x), (v,w), (x,w), (x,y), (w,y), (w,z), (y,z) } Remark: Graph abstraction is useful in other network contexts Example: P2P, where N is set of peers and E is set of TCP connections Transport Layer 3-71 Graph abstraction: costs 5 2 u v 2 1 x • c(x,x’) = cost of link (x,x’) 3 w 3 1 5 z 1 y - e.g., c(w,z) = 5 2 • cost could always be 1, or inversely related to bandwidth, or inversely related to congestion Cost of path (x1, x2, x3,…, xp) = c(x1,x2) + c(x2,x3) + … + c(xp-1,xp) Question: What’s the least-cost path between u and z ? Routing algorithm: algorithm that finds least-cost path Transport Layer 3-72 Routing Algorithm classification Global or decentralized information? Global: all routers have complete topology, link cost info “link state” algorithms Decentralized: router knows physicallyconnected neighbors, link costs to neighbors iterative process of computation, exchange of info with neighbors “distance vector” algorithms Static or dynamic? Static: routes change slowly over time Dynamic: routes change more quickly periodic update in response to link cost changes Transport Layer 3-73 A Link-State Routing Algorithm Dijkstra’s algorithm net topology, link costs known to all nodes accomplished via “link state broadcast” all nodes have same info computes least cost paths from one node (‘source”) to all other nodes gives forwarding table for that node iterative: after k iterations, know least cost path to k dest.’s Notation: c(x,y): link cost from node x to y; = ∞ if not direct neighbors D(v): current value of cost of path from source to dest. v p(v): predecessor node along path from source to v N': set of nodes whose least cost path definitively known Transport Layer 3-74 Dijsktra’s Algorithm 1 Initialization: 2 N' = {u} 3 for all nodes v 4 if v adjacent to u 5 then D(v) = c(u,v) 6 else D(v) = ∞ 7 8 Loop 9 find w not in N' such that D(w) is a minimum 10 add w to N' 11 update D(v) for all v adjacent to w and not in N' : 12 D(v) = min( D(v), D(w) + c(w,v) ) 13 /* new cost to v is either old cost to v or known 14 shortest path cost to w plus cost from w to v */ 15 until all nodes in N' Transport Layer 3-75 Dijkstra’s algorithm: example Step 0 1 2 3 4 5 N' u ux uxy uxyv uxyvw uxyvwz D(v),p(v) D(w),p(w) 2,u 5,u 2,u 4,x 2,u 3,y 3,y D(x),p(x) 1,u D(y),p(y) ∞ 2,x D(z),p(z) ∞ ∞ 4,y 4,y 4,y 5 2 u v 2 1 x 3 w 3 1 5 z 1 y 2 Transport Layer 3-76 Dijkstra’s algorithm, discussion Algorithm complexity: n nodes each iteration: need to check all nodes, w, not in N n(n+1)/2 comparisons: O(n2) more efficient implementations possible: O(nlogn) Oscillations possible: e.g., link cost = amount of carried traffic D 1 1 0 A 0 0 C e 1+e e initially B 1 2+e A 0 D 1+e 1 B 0 0 C … recompute routing 0 D 1 A 0 0 C 2+e B 1+e … recompute 2+e A 0 D 1+e 1 B e 0 C … recompute Transport Layer 3-77 Distance Vector Algorithm (1) Bellman-Ford Equation (dynamic programming) Define dx(y) := cost of least-cost path from x to y Then dx(y) = min {c(x,v) + dv(y) } where min is taken over all neighbors of x Transport Layer 3-78 Bellman-Ford example (2) 5 2 u v 2 1 x 3 w 3 1 5 z 1 y Clearly, dv(z) = 5, dx(z) = 3, dw(z) = 3 2 B-F equation says: du(z) = min { c(u,v) + dv(z), c(u,x) + dx(z), c(u,w) + dw(z) } = min {2 + 5, 1 + 3, 5 + 3} = 4 Node that achieves minimum is next hop in shortest path ➜ forwarding table Transport Layer 3-79 Distance Vector Algorithm (3) Dx(y) = estimate of least cost from x to y Distance vector: Dx = [Dx(y): y є N ] Node x knows cost to each neighbor v: c(x,v) Node x maintains Dx = [Dx(y): y є N ] Node x also maintains its neighbors’ distance vectors For each neighbor v, x maintains Dv = [Dv(y): y є N ] Transport Layer 3-80 Distance vector algorithm (4) Basic idea: Each node periodically sends its own distance vector estimate to neighbors When node a node x receives new DV estimate from neighbor, it updates its own DV using B-F equation: Dx(y) ← minv{c(x,v) + Dv(y)} for each node y ∊ N Under minor, natural conditions, the estimate Dx(y) converge the actual least cost dx(y) Transport Layer 3-81 Distance Vector Algorithm (5) Iterative, asynchronous: each local iteration caused by: local link cost change DV update message from neighbor Distributed: Each node: wait for (change in local link cost of msg from neighbor) each node notifies neighbors only when its DV changes neighbors then notify their neighbors if necessary recompute estimates if DV to any dest has changed, notify neighbors Transport Layer 3-82 Dx(y) = min{c(x,y) + Dy(y), c(x,z) + Dz(y)} = min{2+0 , 7+1} = 2 node x table cost to x y z x ∞∞ ∞ y ∞∞ ∞ z 71 0 from from from from x 0 2 7 y 2 0 1 z 7 1 0 cost to x y z x 0 2 7 y 2 0 1 z 3 1 0 x 0 2 3 y 2 0 1 z 3 1 0 cost to x y z x 0 2 3 y 2 0 1 z 3 1 0 x 2 y 7 1 z cost to x y z from from from x ∞ ∞ ∞ y 2 0 1 z ∞∞ ∞ node z table cost to x y z x 0 2 3 y 2 0 1 z 7 1 0 cost to x y z cost to x y z from from x 0 2 7 y ∞∞ ∞ z ∞∞ ∞ node y table cost to x y z cost to x y z Dx(z) = min{c(x,y) + Dy(z), c(x,z) + Dz(z)} = min{2+1 , 7+0} = 3 x 0 2 3 y 2 0 1 z 3 1 0 time Transport Layer 3-83 Distance Vector: link cost changes Link cost changes: node detects local link cost change updates routing info, recalculates distance vector if DV changes, notify neighbors “good news travels fast” 1 x 4 y 50 1 z At time t0, y detects the link-cost change, updates its DV, and informs its neighbors. At time t1, z receives the update from y and updates its table. It computes a new least cost to x and sends its neighbors its DV. At time t2, y receives z’s update and updates its distance table. y’s least costs do not change and hence y does not send any message to z. Transport Layer 3-84 Distance Vector: link cost changes Link cost changes: good news travels fast bad news travels slow - “count to infinity” problem! 44 iterations before algorithm stabilizes: see text 60 x 4 y 50 1 z Poissoned reverse: If Z routes through Y to get to X : Z tells Y its (Z’s) distance to X is infinite (so Y won’t route to X via Z) will this completely solve count to infinity problem? Transport Layer 3-85 Comparison of LS and DV algorithms Message complexity LS: with n nodes, E links, O(nE) msgs sent DV: exchange between neighbors only convergence time varies Speed of Convergence LS: O(n2) algorithm requires O(nE) msgs may have oscillations DV: convergence time varies may be routing loops count-to-infinity problem Robustness: what happens if router malfunctions? LS: DV: node can advertise incorrect link cost each node computes only its own table DV node can advertise incorrect path cost each node’s table used by others • error propagate thru network Transport Layer 3-86 Multiple Access Links and Protocols Two types of “links”: point-to-point PPP for dial-up access point-to-point link between Ethernet switch and host broadcast (shared wire or medium) traditional Ethernet upstream HFC 802.11 wireless LAN Transport Layer 3-87 Multiple Access protocols single shared broadcast channel two or more simultaneous transmissions by nodes: interference collision if node receives two or more signals at the same time multiple access protocol distributed algorithm that determines how nodes share channel, i.e., determine when node can transmit communication about channel sharing must use channel itself! no out-of-band channel for coordination Transport Layer 3-88 Ideal Mulitple Access Protocol Broadcast channel of rate R bps 1. When one node wants to transmit, it can send at rate R. 2. When M nodes want to transmit, each can send at average rate R/M 3. Fully decentralized: no special node to coordinate transmissions no synchronization of clocks, slots 4. Simple Transport Layer 3-89 MAC Protocols: a taxonomy Three broad classes: Channel Partitioning divide channel into smaller “pieces” (time slots, frequency, code) allocate piece to node for exclusive use Random Access channel not divided, allow collisions “recover” from collisions “Taking turns” Nodes take turns, but nodes with more to send can take longer turns Transport Layer 3-90 Channel Partitioning MAC protocols: TDMA TDMA: time division multiple access access to channel in "rounds" each station gets fixed length slot (length = pkt trans time) in each round unused slots go idle example: 6-station LAN, 1,3,4 have pkt, slots 2,5,6 idle Transport Layer 3-91 Channel Partitioning MAC protocols: FDMA FDMA: frequency division multiple access channel spectrum divided into frequency bands each station assigned fixed frequency band unused transmission time in frequency bands go idle example: 6-station LAN, 1,3,4 have pkt, frequency bands 2,5,6 frequency bands idle Transport Layer 3-92 Random Access Protocols When node has packet to send transmit at full channel data rate R. no a priori coordination among nodes two or more transmitting nodes ➜ “collision”, random access MAC protocol specifies: how to detect collisions how to recover from collisions (e.g., via delayed retransmissions) Examples of random access MAC protocols: slotted ALOHA ALOHA CSMA, CSMA/CD, CSMA/CA Transport Layer 3-93 Slotted ALOHA Assumptions all frames same size time is divided into equal size slots, time to transmit 1 frame nodes start to transmit frames only at beginning of slots nodes are synchronized if 2 or more nodes transmit in slot, all nodes detect collision Operation when node obtains fresh frame, it transmits in next slot no collision, node can send new frame in next slot if collision, node retransmits frame in each subsequent slot with prob. p until success Transport Layer 3-94 Slotted ALOHA Pros single active node can continuously transmit at full rate of channel highly decentralized: only slots in nodes need to be in sync simple Cons collisions, wasting slots idle slots nodes may be able to detect collision in less than time to transmit packet clock synchronization Transport Layer 3-95 Slotted Aloha efficiency Efficiency is the long-run fraction of successful slots when there are many nodes, each with many frames to send Suppose N nodes with many frames to send, each transmits in slot with probability p prob that node 1 has success in a slot = p(1-p)N-1 prob that any node has a success = Np(1-p)N-1 For max efficiency with N nodes, find p* that maximizes Np(1-p)N-1 For many nodes, take limit of Np*(1-p*)N-1 as N goes to infinity, gives 1/e = .37 At best: channel used for useful transmissions 37% of time! Transport Layer 3-96 Pure (unslotted) ALOHA unslotted Aloha: simpler, no synchronization when frame first arrives transmit immediately collision probability increases: frame sent at t0 collides with other frames sent in [t0-1,t0+1] Transport Layer 3-97 CSMA (Carrier Sense Multiple Access) CSMA: listen before transmit: If channel sensed idle: transmit entire frame If channel sensed busy, defer transmission Human analogy: don’t interrupt others! Transport Layer 3-98 CSMA collisions spatial layout of nodes collisions can still occur: propagation delay means two nodes may not hear each other’s transmission collision: entire packet transmission time wasted note: role of distance & propagation delay in determining collision probability Transport Layer 3-99 CSMA/CD (Collision Detection) CSMA/CD: carrier sensing, deferral as in CSMA collisions detected within short time colliding transmissions aborted, reducing channel wastage collision detection: easy in wired LANs: measure signal strengths, compare transmitted, received signals difficult in wireless LANs: receiver shut off while transmitting human analogy: the polite conversationalist Transport Layer 3-100 CSMA/CD collision detection Transport Layer 3-101 “Taking Turns” MAC protocols channel partitioning MAC protocols: share channel efficiently and fairly at high load inefficient at low load: delay in channel access, 1/N bandwidth allocated even if only 1 active node! Random access MAC protocols efficient at low load: single node can fully utilize channel high load: collision overhead “taking turns” protocols look for best of both worlds! Transport Layer 3-102 “Taking Turns” MAC protocols Polling: master node “invites” slave nodes to transmit in turn concerns: polling overhead latency single point of failure (master) Token passing: control token passed from one node to next sequentially. token message concerns: token overhead latency single point of failure (token) Transport Layer 3-103 Ethernet uses CSMA/CD No slots adapter doesn’t transmit if it senses that some other adapter is transmitting, that is, carrier sense transmitting adapter aborts when it senses that another adapter is transmitting, that is, collision detection Before attempting a retransmission, adapter waits a random time, that is, random access Transport Layer 3-104 Ethernet CSMA/CD algorithm 1. Adaptor receives datagram from net 4. If adapter detects another layer & creates frame transmission while transmitting, 2. If adapter senses channel idle, it aborts and sends jam signal starts to transmit frame. If it 5. After aborting, adapter enters senses channel busy, waits until exponential backoff: after the channel idle and then transmits mth collision, adapter chooses a 3. If adapter transmits entire frame K at random from without detecting another {0,1,2,…,2m-1}. Adapter waits transmission, the adapter is done K·512 bit times and returns to with frame ! Step 2 Transport Layer 3-105 Ethernet’s CSMA/CD (more) Jam Signal: make sure all other transmitters are aware of collision; 48 bits Bit time: .1 microsec for 10 Mbps Ethernet ; for K=1023, wait time is about 50 msec See/interact with Java applet on AWL Web site: highly recommended ! Exponential Backoff: Goal: adapt retransmission attempts to estimated current load heavy load: random wait will be longer first collision: choose K from {0,1}; delay is K· 512 bit transmission times after second collision: choose K from {0,1,2,3}… after ten collisions, choose K from {0,1,2,3,4,…,1023} Transport Layer 3-106