TCOM 509: UDP, TCP/IP - Internet Protocols * Obtained permission to use Raj Jain’s technical material 2 IP Routing An example routing table Destination 127.0.0.1 Default 150.100.15.0 H=1 0 Next-Hop 127.0.0.1 150.100.15.54 150.100.15.11 Flags Network Interface H lo0 G emd0 emd0 Destination is a complete host address Destination is a network address Search 1: 2: 3: 4: G=1 G=0 Next-Hop to a router Next-Hop to a directly connected destination Order of routing table Complete match of destination IP Match the network addr (including the subnet ID) Use default router If all previous steps fail to find a suitable entry, send an ICMP “host unreachable error” 3 ROUTING SUMMARY Routing is the process of discovering, selecting and following paths from the transmitting host to the receiving host in a network. There are two categories of routing algorithms: Source Routing: The transmitting host inserts a list of routers that describe a path through the network. Hop-by-Hop Routing: The transmitting host knows how to get to the first router. The router then employs its Routing Table to select the next best hop (router), which selects the next best router, etc. Routing Source Routing Strict Hop-by-Hop Routing Loose Static Dynamic Default Distance Vector: The router sends a list of networks, how far they are and the next hop Distance direction. Vector Link State: The router has a complete topology 1. RIP map of the network. Path Vector: The router sends a complete path to 2. IGRP get to a destination. Link State 1. OSPF 2. LS-LS Path Vector 1. EGP 2. BGP 4 IP Routing Strategies Static Routing Pre-determined administrator Dynamic Routing Interior RIP, routes setup by Routing Protocols OSPF Exterior Routing Protocols BGP 5 How Does IP Routing Work? Basic procedure: search for a matching host address (/32) 2. search for a matching network address (/x, where 0<x<32) 3. search for a default entry (0/0) 4. If all previous steps fail to find a suitable entry, send an ICMP “host unreachable error” 1. IP packets are routed via a “bestmatch” or “longest-match” principle 6 Processing an IP packet 7 IP Source Routing IP routing has no concept of the source determining the route What if the source wanted to specify the packet’s path? The source route option was added to the IP protocol in order to assist in route debugging. Nowadays, it seems to be mainly used by large ISPs, to make sure that their peers aren't inappropriately dumping traffic onto their backbone links. A packet is given a list of desired hops that should be taken on the way to the final destination. 8 IP Source Routing via IP Options CODE = 131 9 Source Routing example 10 Types of Routes Static All packets forwarded to predetermined destinations defined by an administrator Dynamic Packets are forwarded to dynamically calculated routes determined by a routing protocol 11 12 Static Routing Benefits Good for small networks Can help create a secure network Efficiently uses router resources Drawbacks Does not handle network failures well Does not scale well 13 Static Routing Example Destination Next Hop 10.0.0.0 Direct 172.16 Router B 192.168.5 Router C 192.168.6 Router C Network 10 Router A Destination Next Hop 10 Router A 172.16 Direct 192.168.5 Router C 192.168.6 Router C Router C Router B Destination Next Hop 10 Router A 172.16 Router B 192.168.5 Direct 192.168.6 Router D Network 192.168.5 Network 172.16 Router D Destination Next Hop 192.168.6 Direct Default Router C Network 192.168.6 14 Static Routing with Link Failure Destination Next Hop 10 Direct 172.16 Router B 192.168.5 Router C 192.168.6 Router C Network 10 Router A Destination Next Hop 10 Router A 172.16 Direct 192.168.5 Router C 192.168.6 Router C Router C Router B Destination Next Hop 10 Unreachable 172.16 Router B 192.168.5 Direct 192.168.6 Router D Network 192.168.5 Network 172.16 Router D Destination Next Hop 192.168.6 Direct Default Router C Network 192.168.6 15 Dynamic Routing Communicate what? Distance-Vector Link-State Between whom? Routing tables Neighbors Interface status All routers 16 Dynamic IP Routing Protocols RIP (Distance Vector) OSPF (Link State) IS-IS (Link State) BGP (Path Vector) 17 Distance Vector vs. Link State 18 Concept of Administrative Distance Connected 0 Static to Interface or Static to Next Hop 1 E-IGRP (Cisco only) 90 OSPF 110 IS-IS 115 RIP v1 and v2 120 Only one IGP route is installed in the routing table Administrative Distances of Routing Protocols: Measures trustworthiness of the source of route - Handles preferences when multiple sources of routing info exists in router - Protocol with lowest admin weight wins 19 Distance-Vector and Link State Protocol Protocol Category Metric Algorithm RIP v1 Distance Vector Hop Count Bellman-Ford RIP v2 Distance Vector Hop Count Bellman-Ford OSPF Link State Bandwidthbased cost Shortest Path First IGRP Distance Vector Composite Bellman-Ford 20 What is RIP (Routing Info Protocol)? RIP is a Interior Gateway Protocol (IGP) Used within an Autonomous System (AS) A collection of routers under the same administrative authority Two versions RIP v1 (RFC 1058) RIP v2 (RFC 2453) 21 Distance Vector Routing Protocol RIP v1 - Characteristics Directly connected subnets are known Routing updates are broadcasted to neighbors Listen to routing updates Metrics are used Routing info consists of subnet and metric Periodic updates (30 sec) A route is learned via a neighbor Failed route has a metric of infinite 22 RIP Uses UDP RIP is a UDP-based protocol. Each router that uses RIP has a routing process that sends and receives datagrams on UDP port number 520, the RIP-1/RIP-2 port. All communications intended for another routers's RIP process are sent to the RIP port. All routing update messages are sent from the RIP port. 23 24 25 RIP Characteristics Distance-vector routing protocol Updates contain routes (vectors) and the cost (distance) to reach them and consist of the following steps: Each node calculates the distances between itself and all other nodes within the AS and stores this information as a table. Each node sends its table to all neighboring nodes. When a node receives distance tables from its neighbors, it calculates the shortest routes to all other nodes and updates its own table to reflect any changes Does not scale well for large networks as every router has to add a RIP route for every newly added network Hop count is used as the metric for path selection, based on Bellman-Ford distance-vector routing algorithm Maximum allowable hop count is 15 Routing updates are broadcast every 30 seconds 26 RIP Message Types Two message types Request message Ask neighbors to send routes Response message Carries route updates Advertises 25 routes per update Router decides how to handle routes in update Add, modify, or delete 27 RIP Routing Metrics Counts the number of hops between source and destination Number of hops is the number of router hops Hop count equals the RIP metric RIP cannot determine measured delay, reliability, load, or link bandwidth With multiple paths to the same prefix, one with fewest hops is selected May not be optimum path 28 RIP in Action (1): 162.11.5.0 Router A Router A & C is down Tr0 1 s0 s1 162.11.9.0 162.11.8.0 Routing table B s0 Router C s0 Router B E0 162.11.10.0 E0 162.11.7.0 162.11.8.0 E0 s0 1 1 162.11.7.0 29 RIP in Action (2): 162.11.5.0 Router C is down Router A is switched on Router A Tr0 162.11.5.0 1 1 s0 s1 162.11.9.0 162.11.8.0 s0 Router C 162.11.9.0 1 s0 Router B E0 162.11.10.0 E0 162.11.5.0 162.11.7.0 162.11.8.0 162.11.9.0 s0 E0 s0 s0 2 1 1 2 162.11.7.0 30 RIP in Action (3): Router C is switched on 162.11.5.0 Router A Tr0 2 s0 s1 162.11.9.0 162.11.5.0 1 162.11.9.0 1 162.11.10.0 2 162.11.8.0 1 162.11.10.0 1 s0 Router C s0 Router B E0 162.11.10.0 E0 162.11.5.0 162.11.7.0 162.11.8.0 162.11.9.0 162.11.10.0 s0 E0 s0 s0 s0 2 1 1 2 3 162.11.7.0 31 RIP Timers RIP uses numerous timers to regulate its performance. These include a routing-update timer, a route-timeout timer, and a route-flush timer. Routing-update timer - clocks the interval between periodic routing updates. Generally, it is set to 30 seconds, with a small random amount of time added whenever the timer is reset. This is done to help prevent congestion, which could result from all routers simultaneously attempting to update their neighbors. Route-timeout timer - Each routing table entry has a routetimeout timer associated with it. When the route-timeout timer expires, the route is marked invalid but is retained in the table until the route-flush timer expires. Default value is 120 secs Route-flush timer - If 180 seconds elapse from the last time the timeout was initialized, the route is considered to have expired, and the deletion process described below begins for that route. Default value is 180 secs. 32 Bellman Ford’s Distance Vector Algorithm – Example http://www.laynetworks.com/Simulation%20of%2 0Bellman%20Algorithm.htm 33 Disadvantages with the Bellman Ford’s Algorithm Does not scale well Changes in network topology are not reflected quickly since updates are spread node-by-node. Counting to infinity (if link or node failures render a node unreachable from some set of other nodes, those nodes may spend forever gradually increasing their estimates of the distance to it, and in the meantime there may be routing loops) 34 Count To Infinity Problem – URL Link 35 Improving Convergence Split Horizon For interface X, don’t advertise routes out X that you learned via X prevents forwarding loops only for 2 adjacent router case joke analogy: if you tell me a joke and you get it, I don’t need to tell it back to you Hold Down Timers refuse to accept any information for a period of time (60 secs) after a route is declared unreachable can increase convergence time Triggered Updates when a change occurs, send update immediately (don’t wait for next update interval) Change can be defined as an observed increase in hop count over time (1.6 –2.0 increase in originally store hop count) Attempt to speed up convergence Hold Down Timers and Triggered Updates Can be used together to be more effective Split Horizon with Poison Reverse a.k.a. “Infinite Split Horizon” For interface X, DO advertise routes out X that you learned via X, but with a metric of INFINITY advantage: eliminates two-router loops disadvantage: increases the size of routing updates None of these mechanisms can completely avoid routing loops and counting to infinity doesn’t go away 36 RIPv2 (rfc 2453) – Solves some of the RIPv1 Shortcomings RIPv2 is classless and uses UDP port 520 as does RIPv1 (classful). It is still distance vector and still uses hop count as the metric with a max hop count of 15. The ability to multicast saves other devices on the network from wasting time opening broadcast packets. 37 38 RIPv2 – Subnet Mask Classless routing protocols carry the subnet mask. This allows all 0 and 1 subnets to be used, eliminating confusion between 172.16.255.255 and 172.16.255.255. Here, one is the 'all subnets' broadcast and one is broadcast on the all 1s subnet - but which is which? If the subnet mask is sent then 172.16.255.255 /16 and 172.16.255.255 /24 can be differentiated. 39 RIPv2 – Route Tag Each RIPv2 entry includes a Route Tag field, where additional information about a route can be stored. It provides a method for distinguishing between internal routes (learned by RIP) and external routes (learned from other protocols). 40 RIPv2 – Next Hop In RIPv2, each RIP entry includes a space where an explicit IP address can be entered as the next hop router for datagrams intended for the network in that entry Specifying a value of 0.0.0.0 in this field indicates that routing should be via the originator of the RIP advertisement. The purpose of the Next Hop field is to eliminate packets being routed through extra hops in the system. It is particularly useful when RIP is not being run on all of the routers on a network. A simple example is given in Appendix A. Note that Next Hop is an "advisory" field. That is, if the provided information is ignored, a possibly sub-optimal, but absolutely valid, route may be taken. If the received Next Hop is not directly reachable, it should be treated as 0.0.0.0. ----- ----- --------- ----- ----|IR1| |IR2| |IR3| |XR1| |XR2| |XR3| --+-- --+-- --+---+-- --+-- --+-| | | | | | --+-------+-------+----------------+-----+------+-<-------------RIP-2-----------------> Assume that IR1, IR2, and IR3 are all "internal" routers which are under one administration (e.g. a campus) which has elected to use RIP-2 as its IGP. XR1, XR2, and XR3, on the other hand, are under separate administration (e.g. a regional network, of which the campus is a member) and are using some other routing protocol (e.g. OSPF). XR1, XR2, and XR3 exchange routing information among themselves such that they know that the best routes to networks N1 and N2 are via XR1, to N3, N4, and N5 are via XR2, and to N6 and N7 are via XR3. By setting the Next Hop field correctly (to XR2 for N3/N4/N5, to XR3 for N6/N7), only XR1 need exchange RIP-2 routes with IR1/IR2/IR3 for routing to occur without additional hops through XR1. Without the Next Hop (for example, if RIP-1 were used) it would be necessary for XR2 and XR3 to also participate in the RIP-2 protocol to eliminate extra hops. 41 RIPv2 - Authentication 8 bits 8 Command bits 8 Version 0XFFF bits 8 bits Unused - set to all zeros Authentication Type Password (bytes 0-3) Password (bytes 4-7) Password (bytes 8-11) Password (bytes 12-15) RIPv2 authenticates the source of the packets. The source of the update uses the first field of the message that would normally carry IP address, SM, Next Hop, Metric and hijacks these for authentication. This leaves room for only 24 updates per packet instead of 25 with RIPv1. A password is indicated if the AFI field is set to 0XFFF. The authentication type for simple authentication is set to 0X002. The password is left justified and unused bits are set to zero. MD5 authentication may be enabled to overcome plain-text authentication. Use the Authentication Type field to identify the method used. MD5 computes a 128-bit hash value from plain text plus password. This hash is transmitted along with the message and the hash is recalculated at the far end and the received and calculated hash values are checked against each other. If they match, the message is authenticated. 42 RIP v2 Packet Format 8 0 Command 16 Version 24 31 Reserved (Must be zero) Route Tag Address Family Identifier IP Address Subnet Mask Next Hop Metric … Route Tag Address Family Identifier IP Address Subnet Mask Next Hop Metric 43 RIP Limitations Maximum network diameter = 15 Lack of alternative routes. RIPv2 keeps only one route to a destination in routing tables. It has to wait for updates after a failure to assess whether a new (if any) route exists Regular updates include entire routing table approximately every 30 seconds Poison reverse increases the size of the routing updates Count to infinity slows route loop prevention Metrics only involve hop count Broadcasts between neighbors (RIPv1 only) Classful routing means no prefix length carried in route updates (RIPv1 only) – and no VLSM No authentication mechanism (RIPv1 only) Slow convergence 44 OSPF OSPF Concept : Having the Same Copy of Network Topology at Every Node R1 LSA R3 LSA R2 LSA R5 LSA R4 LSA xyz LSA R6 LSA abc LSA pdq LSA 46 SPF Algorithm The Shortest Path First (SPF) routing algorithm is the basis for OSPF operations. When an SPF router is powered up, it initializes its routing-protocol data structures and then waits for indications from lower-layer protocols that its interfaces are functional. After a router is assured that its interfaces are functioning, it uses the OSPF Hello protocol to acquire neighbors, which are routers with interfaces to a common network. The router sends hello packets to its neighbors and receives their hello packets. In addition to helping acquire neighbors, hello packets also act as keepalives to let routers know that other routers are still functional. On multi-access networks (networks supporting more than two routers), the Hello protocol elects a designated router and a backup designated router. Among other things, the designated router is responsible for generating LSAs for the entire multi-access network. Designated routers allow a reduction in network traffic and in the size of the topological database. When the link-state databases of two neighboring routers are synchronized, the routers are said to be adjacent. On multiaccess networks, the designated router determines which routers should become adjacent. Topological databases are synchronized between pairs of adjacent routers. Adjacencies control the distribution of routing-protocol packets, which are sent and received only on adjacencies. Each router periodically sends an LSA to provide information on a router's adjacencies or to inform others when a router's state changes. By comparing established adjacencies to link states, failed routers can be detected quickly, and the network's topology can be altered appropriately. From the topological database generated from LSAs, each router calculates a shortest-path tree, with itself as root. The shortest-path tree, in turn, yields a routing table. 47 How OSPF Protocol Works Stage 1: Discovering Neighbors => Hello Message Stage 2: Electing the Designated Router => Hello Message Stage 3: Establishing Adjacencies => DB Description Msgs Stage 4: Propagating Link State Information => Flooding using LS Request/Update Msgs) Stage 5: Calculating the Routing Table(s) => (Diksjtra’s Algorithm) 48 First Requirements for the new IGP to Replace RIP Had to be more efficient than RIP Faster convergence than RIP consume fewer network resources: link bandwidth and CPU cycles communicate changes quickly link, interface, and router failures More descriptive metric than RIP hop-count limitations ability to include other factors (bandwidth, delay, reliability, etc.) Figure 3.2 OSPF ended up using cost 16 bits, no limit on total path cost eliminated network diameter limitations 49 New IGP Requirements (2) Support for load balancing over multiple equal-cost links to a destination more efficient use of network resources implementation was not mandated by the protocol multiple strategies exist: flow-based, round-robin, hash function, packet-by-packet in theory, multiple vendor strategies can be combined in a single network and it can still work, but this needs to be examined closely in some cases Support for a routing hierarchy split the AS up into mini-AS’s, in a sense a scalability mechanism 50 New IGP Requirements (3) Separate internal and external routes Support for more flexible subnetting essentially CIDR addressing and notation, no notion of classful routing Security RIPv1 had no way to distinguish You generally trust info from your AS over routes from outside your AS ability to control what routers participate in OSPF routings based on a password ToS-based routing allow specification of different metrics for each of the original ToS categories in reality, never really used; chicken-and-egg problem 51 52 What is OSPF? An IGP using Link-State technique to update routing tables Based on the shortest path first (SPF) algorithm, also known as the Dijkstra algorithm Created to fill the need for a high functionality, standards-based IGP for the TCP/IP protocol family Main RFCs: 1587 – OSPF NSSA Option 2328 – OSPF Version 2 (current implementation) 53 What is a Link-State Protocol ? Link = router interface State = description of interface and its relationship to neighboring routers OSPF routers send link-state advertisements (LSAs) to all other routers within the same hierarchical area Routers store information in a link-state, or topological, database Each OSPF router uses the SPF algorithm to calculate the shortest path to each node 54 Three (3) Types of OSPF LS Messages 1. LSA (Link State Advertisement): LSAs are included in the database description packets (DDPs or DBDs). LSA entries include link-state type, the address of the advertising router, the cost of the link, and the sequence number. 2. LSR ( Link State Request): When a slave router receives an DDP (Database Description Packet), it sends and LSAck packet. Then it compares the received information with the information it has. If the DDP has more recent information, the slave router sends a linkstate request (LSR) to the master router. 3. LSU ( Link State Update): LSU packet is sent in response to LSR (Link-State Request) packet sent from a slave router to a master router. LSU contains complete information about the requested entry. 55 What is SPF? Places each router at the root of a tree and calculates the shortest path to each destination based on the cumulative cost to reach that destination Each router has its own view of the topology even though all the routers build a shortest path tree using the same linkstate database 56 SPF Cost Cost, or metric, of an interface indicates the overhead required to send packets across that interface Cost = 10**8/bandwidth (bps) Higher bandwidth = lower cost 10M Ethernet line cost = 10**8/10**7 = 10 T1 line cost = 10**8/1544000 = 64 To handle hi-speed links, use a value greater than 10**8 in the cost calculation This is the Reference Bandwidth 57 58 Shortest Path Tree Router A’s SPF tree A is the Root; use the least-cost path to each IP prefix If a link goes down, the SPF tree is recalculated Each router calculates its own SPF tree Router A Router D 10 10 0 128.213.0.0 Router B 5 5 192.213.11.0 Router D 10 7 222.211.10.0 59 Dijkstra’s Link State Algorithm Principle: Dijkstra's algorithm works on the principle that the shortest possible path from the source has to come from one of the shortest path already discovered. Layman’s Terms: Using the street map, you're marking over the streets (tracing the street with a marker) in a certain order, until you have a route marked in from the starting point to the destination. The order is conceptually simple: from all the street intersections of the already marked routes, find the closest unmarked intersection - closest to the starting point (the "greedy" part). It's the whole marked route to the intersection, plus the street to the new, unmarked intersection. Mark that street to that intersection, draw an arrow with the direction, then repeat. Never mark to any intersection twice. When you get to the destination, follow the arrows backwards. There will be only one path back against the arrows, the shortest one. Demo: http://www.eng.tau.ac.il/~shtilman/C-Programming/Year2/dijkstra.html http://www.oopweb.com/Algorithms/Documents/PLDS210/Volume/dij-op.html 60 OSPF Breaks an AS into Areas AS 100 ABR Area 234 Area 0 Area 10 ABR 61 62 Area Sizing Guidelines Rules of thumb for non-backbone area No more than 100 routers No more than 50 neighbors per router Decrease when media unstable Consider static/default and demand techniques Decrease when large numbers of externals injected Consider if the incoming externals can be summarized or filtered 63 When Might Single-Area OSPF make sense? Fewer than 50 routers with alternate paths Needs: multivendor compatibily fast convergence VLSM complex defaults and externals No clear candidates for core OSPF power greatest with hierarchy Multiple domains may be better than 1 area 64 Design Guidelines – Network Topology (Cont’d) OSPF Network Size Recommendation 65 Design Guidelines – Network Topology (Cont’d) How Many Areas Should Be Connected per ABR? 66 OSPF : Location of different routers 67 Different Types of OSPF routers Internal router: An internal router has all the interfaces in the same area. All internal routers have same link state databases. Backbone router: Backbone routers sit on the perimeter of Area 0, with at least one interface connected to backbone (Area 0). Area Border Router (ABR): ABRs are routers that have interfaces attached to multiple areas. It may be noted that these routers maintain separate link-state databases for each area that they are connected. They are capable of routing traffic destined for or arriving from other areas. Autonomous System Boundary Router (ASBR): These are the routers that have at least one interface to the external network (another autonomous system). This autonomous network can be non-OSPF. ASBRs are capable of route redistribution, a term used to imply that the concerned router can import routing information from non-OSPF networks and distribute the same in OSPF network for which it is responsible and visa versa. 68 OSPF Terminology Area ID: A 32 bit number identifying an area. Acquired from IANA. Router ID: A 32 bit number identifying a router. Normally the lowest numbered IP address belonging to a router. Router Priority: An 8 bit number that indicates this router’s willingness to be a designated/backup designated router. A router priority of Zero indicates that this router is ineligible to be a designated router. LINK State Advertisement: Exchanged by adjacent routers to allow area topology databases to be maintained and interarea and intra-AS routes to be advertised. The are five types of link state advertisements. 69 70 71 OSPF NETWORKS OSPF supports three kinds of connections and networks. Point-to-Point between exactly two routers. Multi-access networks with broadcasting(e.g., Ethernet, T-R, etc) Multi-access networks without broadcasting (e.g., packet switching WANs) Point-to-Point Network Multi-access w/ Broadcasting Multi-access w/o Broadcasting X.25 NETWORK 72 PROTOCOL ENCAPSULATION and OSPF PROTOCOL NUMBER NOTE OSPF uses direct IP encapsulation. Protocol 89 is used for OSPF. OSPF is sent as multicast on pt- OSPF to-pt and broadcast networks. 224.0.0.5 NETWORK LAYER Protocol Type 89 IP Header Source IP Address: 128.66.12.2 Destination IP Address: 224.0.0.5 DATA LINK LAYER ETHERNET PREAMBLE DESTINATION ADDR 00 00 1B 12 23 34 SOURCE ADDR 00 00 1B 09 08 07 FIELD TYPE IP HEADER OSPF FCS 73 74 OSPF MESSAGE TYPES HELLO (Type 1) is used to: identify neighbors, to elect a designated Router for multi-access network, to find out about an existing Designated Router and as "I'm alive" signal. DATABASE DESCRIPTION (Type 2) is used to exchange information during initialization so that a router can find out what data is missing from its topology database. Each LSA is preceded by a common LS Advertisement Header. LSA Specific Type 1 Router Link Advertisement LSA Specific Type 2 Network Link Advertisement LSA Specific Type 3 Summary Link Advertisement to other Areas. LSA Specific Type 4 Summary Link Advertisement to ASBR. LSA Specific Type 5 AS External Link Advertisement LINK STATE REQUEST (Type 3) is used to ask for data that a router has discovered is missing from its topology database or to replace data that is out of date. Database descriptions are exchanged first then Link State request are submitted to resolve missing or suspicious data. LINK STATE UPDATE (Type 4) is used to reply to a Link State Request and also to dynamically report changes in network topology. LINK STATE ACKNOWLEDGEMENT (Type 5) is used to confirm receipt of a Link State Update. The sender will retransmit until an update is ACKed. 75 Link State Advertisement Types Router Link Advertisement (LSA Type 1) Generated by all OSPF routers and describe the state of the router's interface (links) within the area. They are flooded throughout a single area only. Network Link Advertisement (LSA Type 2) Generated by the Designated Router (DR) on a multi-access network and lists the routers connected to the network. They are flooded throughout a single area only. Summary Link Advertisements Generated by Area Border Routers (ABR) and flooded throughout a single area only. There are two types: A summary advertisement (LSA Type 3) describing routes to destinations in other areas within the same AS. A summary advertisement (LSA Type 4) describing routes to AS Boundary Routers. For routers to get information out of the AS. AS External Link Advertisement (LSA Type 5) Generated by the AS Boundary Routers(ASBR) to describe routes to destinations external to the OSPF network. They are flooded to all areas in the OSPF network. 76 LSA Types Used in Flooding Router Links Type 1 Summary Links Types 3 and 4 ABR Describe the state and cost of the router’s links (interfaces) to the area (Intra-area). Network Links Type 2 DR Originated for multi-access segments with more than one attached router. Describe all routers attached to the specific segment. Originated by a Designated Router (discussed later on). Originated by ABRs only. Describe networks in the AS but outside of area (Inter-area). Also describe the location of the ASBR. External Links Type 5 ASBR Originated by an ASBR. Describe destinations external to the autonomous system or a default route to the outside AS. 77 LSA Specific - Description LSA Specific Type 1 (Router Description) These are the router-LSAs. They describe the collected states of the router's interfaces. For more information, consult Section 12.4.1. LSA Specific Type 2 (Network Description) These are the network-LSAs. They describe the set of routers attached to the network. LSA Specific Type 3 or 4 These are the summary-LSAs. They describe inter-area routes, and enable the condensation of routing information at area borders. Originated by area border routers, The Type 3 summary-LSAs describe routes to networks (Network Description) Type 4 summary-LSAs describe routes to AS boundary routers. (Router Description) LSA Specific Type 5 (Network Description) These are the AS-external-LSAs. Originated by AS boundary routers, they describe routes to destinations external to the Autonomous System. A default route for the Autonomous System can also be described by an AS-external-LSA. 78 BASIC OSPF OPERATIONAL SEQUENCE S1: Routers discover their OSPF neighbors. When the OSPF routers first start they establish and maintain a relationship with their neighbors using the Hello protocol. S2: Routers elect a Designated Router (DR) and a Backup Designated Router (BDR) for a network (LAN) with multiple routers using the Hello protocol. S3: The routers form adjacencies. For routers on Multi-access networks all routers become adjacent to the DR and the BDR. 79 BASIC OSPF OPERATIONAL SEQUENCE Contd S4: Adjacent routers then exchange Database Description packets which may be part or all of the routers Link State Database. The adjacent routers then synchronize their Link State Databases by requesting missing or outdated information on the advertised links. This is done through a Link State Request packet . The response is a Link State Update packet. A Link State Acknowledgement packet is used to confirm the correct receipt of a Link State Update packet. S5: The routers then calculate the routing table by running the Shortest Path First(SPF) algorithm using the Link State Database as input. The routers periodically engage in advertising its Link States based upon a refresh timer expiration or a link state change. They then recalculate their routing table. 80 How OSPF Protocol Works S1: Discovering Neighbors => Hello Messages S2: Electing the Designated Router => Hello Messages S3: Establishing Adjacencies => DB Description Msgs S4: Propagating Link State Information => Flooding using LS Request/Update Msgs) S5: Calculating the Routing Table(s) => (Diksjtra’s Algorithm) 81 Discovering Neighbors – Hello Protocol 82 Hello Exchange Process – Pt-to-Pt Link 83 Hello Exchange Process – Ethernet Link 84 OSPF HELLO MESSAGE 0 8 16 24 Message Type Version 31 Message Length Router Identification Area Identification Checksum Common Message Header Authentication Type Authentication (octets 0-3) Authentication (octets 4-7) Network Mask Options Hello Interval Dead Interval Timer Designated Router Backup Designated Router ... Neighbor One IP Address E T Router Priority Hello Message Type Identifies neighbors Elects the Designated Router(DR) Find out about an existing DR An Alive Signal NETWORK MASK: This field contains the subnet mask of the network over which the message was sent (the mask associated with the interface). If this field does not match the receiving router's mask for that network, the receiving router rejects the Hello message and does not accept the transmitting router as a neighbor. In the absence of subnetting it is set to the default subnet mask. HELLO INTERVAL: This field tells how often in seconds this router transmits its Hello messages. A Broadcast is normally 10 seconds. A non-Broadcast is normally every 30 seconds. HelloInterval and RouterDeadInterval fields in sent OSPF packet must match with the settings configured in the receiving interface. 85 How OSPF Protocol Works S1: Discovering Neighbors => Hello Message S2: Electing the Designated Router => Hello Message S3: Establishing Adjacencies => DB Description Msgs S4: Propagating Link State Information => Flooding using LS Request/Update Msgs) S5: Calculating the Routing Table(s) => (Diksjtra’s Algorithm) 86 DR and BDR Election - Example 87 How OSPF Protocol Works S1: Discovering Neighbors => Hello Message S2: Electing the Designated Router => Hello Message S3: Establishing Adjacencies => DB Description Msgs S4: Propagating Link State Information => Flooding using LS Request/Update Msgs) S5: Calculating the Routing Table(s) => (Diksjtra’s Algorithm) 88 89 Database Sync Process In a link-state routing algorithm, it is very important for all routers' link-state databases to stay synchronized in order to have a compatible routing tables. OSPF simplifies this by requiring only adjacent routers to remain synchronized. The synchronization process begins as soon as the routers attempt to bring up the adjacency. Each router describes its database by sending a sequence of Database Description packets to its neighbor. Each Database Description Packet describes a set of LSAs belonging to the router's database. This sending and receiving of Database Description packets is called the "Database Exchange Process". During this process, the two routers form a master/slave relationship. Each Database Description Packet has a sequence number. Database Description Packets (DDPs) sent by the master (polls) are acknowledged by the slave through echoing of the sequence number. The DB exchange initially only sends the LSA headers and not the LSA info to achieve better BW and processing efficiencies. The master is the only one allowed to retransmit Database Description Packets. It does so only at fixed intervals, the length of which is the configured per-interface constant RxmtInterval. Each Database Description contains an indication that there are more packets to follow --- the M-bit. The Database Exchange Process is over when a router has received and sent Database Description Packets with the M-bit off. During and after the Database Exchange Process, each router has a list of those LSAs for which the neighbor has more up-to-date instances. These LSAs are requested in Link State Request Packets. Link State Request packets that are not satisfied are retransmitted at fixed intervals of time RxmtInterval. When the Database Description Process has completed and all Link State Requests have been satisfied, the databases are deemed synchronized and the routers are marked fully adjacent. At this time the adjacency is fully functional and is advertised in the two routers' router-LSAs. Criteria for determining adjacency between 2 routers: Have the same number of LSAs in their LSDBs Sum of their LSA’s LS Checksum fields are equal 90 OSPF DB DESCRIPTION MESSAGE 0 8 16 24 Message Type Version 31 Message Length Router Identification Area Identification Checksum Common Message Header Authentication Type Authentication (octets 0-3) 0 4-7) 8 Authentication (octets 16 LS Age 24 Options 31 LS Type DB Descrioption Message Type Establishing adjacency Link State Identification Advertising Router Link State Sequence Number LS Checksum Length NETWORK MASK: This field contains the subnet mask of the network over which the message was sent (the mask associated with the interface). If this field does not match the receiving router's mask for that network, the receiving router rejects the Hello message and does not accept the transmitting router as a neighbor. In the absence of subnetting it is set to the default subnet mask. HELLO INTERVAL: This field tells how often in seconds this router transmits its Hello messages. A Broadcast is normally 10 seconds. A non-Broadcast is normally every 30 seconds. HelloInterval and RouterDeadInterval fields in sent OSPF packet must match with the settings configured in the receiving interface. 91 OSPF DB DESCRIPTION MESSAGE WITH LINK ADVERTISEMENT HEADER Message Type Version Headers contain enough information to identify the LS records needed during synchronization. The receiver marks the LS Records to be requested. Message Length Router Identification Area Identification Checksum Authentication Type Authentication (octets 0-3) 0 4-7) 8 Authentication (octets 16 LS Age 24 Options 31 LS Type Link State Identification Advertising Router LS Types are: Type 1: Router Links Type 2: Network Links Type 3/4: Summary Links Type 5: External Links Link State Sequence Number LS Checksum Length LS Age: A 16 bit number indicating the time in seconds since the origin of the advertisements. This time increases as the link state advertisement resides in the router database and/or with each hop count. When it reaches a maximum value, normally one hour, it is discarded unless needed for synchronization. Options: See the Hello Packet. LS Type: This field specifies which of five different link state advertisements is contained in this header. Link State ID: A unique ID for the advertisement which is dependent upon the message type. LSA Message Types 1/4 uses the Router ID. LSA Message Type 2 uses the IP address of the Designated Router. LSA Message Types 3/5 uses an IP network number. 92 OSPF DB DESCRIPTION MESSAGE WITH LINK ADVERTISEMENT HEADER (CONT’D) Headers contain enough information to identify the LS records needed during synchronization. The receiver marks the LS Records to be requested. Message Type Version Message Length Router Identification Area Identification Checksum Authentication Type Authentication (octets 0-3) 0 4-7) 8 Authentication (octets 16 LS Age 24 Options 31 LS Type Link State Identification Advertising Router LS Types are: Type 1: Router Links Type 2: Network Links Type 3/4: Summary Links Type 5: External Links Link State Sequence Number LS Checksum Length Advertising Router: The Router ID of the router that originated the link state advertisement. LSA Message Type 1 is identical to the Link State ID. LSA Message Type 2 uses the Router ID of the Network's Designated Router. LSA Message Types 3/4 use Router ID of the Area Border Router. LSA Message Type 5 uses the Router ID of the AS Boundary Router. LS Sequence Number: This field is used to sequence the advertisements and to detect duplicate or old packets. LS Checksum: The Checksum of the complete Link State Advertisement excluding the LS Age field. Length: The length is the size of the advertisements in bytes including the Advertisement Header. 93 LSA Flooding - Operations 94 Example of Router LSAs 95 Example of Network LSAs 96 Example of Summary LSAs 97 External Route LSAs - Example 98 How OSPF Protocol Works S1: Discovering Neighbors => Hello Message S2: Electing the Designated Router => Hello Message S3: Establishing Adjacencies => DB Description Msgs S4: Propagating Link State Information => Flooding using LS Request/Update Msgs) S5: Calculating the Routing Table(s) => (Diksjtra’s Algorithm) 99 Propagating LS Info: When a link state changes 100 Propagating LS Info: When a link state changes 101 102 How OSPF Protocol Works S1: Discovering Neighbors => Hello Message S2: Electing the Designated Router => Hello Message S3: Establishing Adjacencies => DB Description Msgs S4: Propagating Link State Information => Flooding using LS Request/Update Msgs) S5: Calculating the Routing Table(s) => (Diksjtra’s Algorithm) 103 Pros and Cons of OSPF Advantages of OSPF: 1.Changes in an OSPF network are propagated quickly. 2.OSPF is heirarchical, using area 0 as the top as the heirarchy. 3.OSPF is a Link State Algorithm. 4.OSPF supports Variable Length Subnet Masks (VLSM). 5.OSPF uses multicasting within areas. 6.After initialization, OSPF only sends updates on routing table sections which have changed, it does not send the entire routing table. 7.Using areas, OSPF networks can be logically segmented to decrease the size of routing tables. Table size can be further reduced by using route summarization. 8.OSPF is an open standard, not related to any particular vendor. 9.Can load-balance up to 6 equal-cost routes with 4 as the default Disadvantages of OSPF: 1.OSPF is very processor intensive. 2.OSPF maintains multiple copies of routing information, increasing the amount of memory needed. 3.Using areas, OSPF can be logically segmented (this can be a good thing and a bad thing). 4.OSPF is not as easy to learn as some other protocols. 5.In the case where an entire network is running OSPF, and one link within it is "bouncing" every few seconds, OSPF updates would dominate the network by informing every other router every time the link changed state 104 105 BGP - Autonomous System Networks and Routers under a single administrative authority Each AS is assigned a number AS numbers range form 1 to 65,535 106 Different AS Types http://ipmon.sprint.com/pubs_trs/tutorials/Taft_BGP.pdf (slide 30) 107 BGP is An Exterior Gateway Protocol (EGP), used to propagate tens or hundreds of thousands of routes between networks (ASs). The only protocol used to do this on the Internet today. 108 What is BGP? BGP is an inter-domain routing protocol that communicates prefix reachability BGP is a path vector protocol Similar to distance vector BGP views the Internet as a collection of autonomous systems Stability is very important to the Internet and BGP BGP supports CIDR BGP routers exchange routing information between peers Defined in RFC 1771 109 How Does BGP Work? BGP uses TCP as its transport protocol (port 179). Two BGP routers form a TCP connection between one another (peer routers) and exchange messages to open and confirm the connection parameters. BGP routers exchange network reachability information. This information is mainly an indication of the full paths (BGP AS numbers) that a route should take in order to reach the destination network. This information helps in constructing a graph of ASs that are loop-free and where routing policies can be applied in order to enforce some restrictions on the routing behavior. Any two routers that have formed a TCP connection in order to exchange BGP routing information are called peers, or neighbors. BGP peers initially exchange their full BGP routing tables. After this exchange, incremental updates are sent as the routing table changes. BGP keeps a version number of the BGP table, which should be the same for all of its BGP peers. The version number changes whenever BGP updates the table due to routing information changes. Keepalive packets are sent to ensure that the connection is alive between the BGP peers and notification packets are sent in response to errors or special conditions. 110 BGP Fundamentals BGP peers exchange routes and send updates not faster than every 90 seconds by default. Routes consist of destination prefixes with an AS path and BGP-specific attributes Each BGP update contains one path advertisement and attributes Many destinations can share the same path BGP compares the AS path and attributes to choose the best path Unfeasible routes can be advertised Unreachable routes are withdrawn 111 BGP Connections BGP updates are incremental No regular refreshes Except at session establishment, when volume of routing can be high BGP runs over TCP connections TCP port 179 TCP Services Fragmentation, Acknowledgements, Checksums, Sequencing, and Flow Control No automatic neighbor discovery 112 BGP Peering BGP sessions are established between peers Two types of peering sessions BGP Speakers E-BGP (external) peers with different ASs I-BGP (internal) peers within the same AS Still requires interior gateway protocols (IGPs) IGP connects BGP speakers within the AS IGP advertises internal routes 113 iBGP AS 3847 When BGP speakers in the same AS form a BGP connection for the purpose of exchanging routing information, they are said to be running IBGP or internal BGP. A c B IBGP speakers are usually fully-meshed. 114 eBGP (1) When BGP speakers in different ASs form a BGP connection for the purpose of exchanging routing information, they are said to be running EBGP or external BGP. EBGP peers are usually directly connected. AS 3561 A AS 3847 B 115 eBGP (2) AS 2033 AS 7007 AS 4200 AS 2041 116 iBGP and eBGP Diagram AS 1239 AS 7007 XP AS 701 AS 6079 AS 4006 117 eBGP Rules By default, only talks to directly-connected router. Sends the one best BGP route for each destination. Sends all of the important “attributes”; omits the “local preference” attribute. Adds (prepends) the speaker’s ASN to the “as-path” attribute. Usually rewrites the “next-hop” attribute. 118 iBGP Rules Can talk to routers many hops away by default. Can only send routes it “injects”, or routes heard DIRECTLY from an external peer. Thus, requires a FULL mesh. Sends all attributes. Leaves the as-path attribute alone. Doesn’t touch the “next hop” attribute. 119 Logical view of 16 routers, fully meshed 120 iBGP Restriction (1) Assume AS1239 sends route 10.0.0.0/8 to AS2828. Router A will send that route to Routers B and C. B AS 2828 C A AS 1239 121 iBGP Restriction (2) When Router B receives 10.0.0.0/8, it will not propagate that route to Router C because it was learned from an iBGP neighbor. Router C will behave similarly. B AS 2828 C A AS 1239 122 BGP Route Advertisement Only advertise the active BGP routes to peers (by default) Never forward I-BGP routes to I-BGP peers BGP Next-hop must be reachable Prevents loops Withdraw routes if active BGP routes become unreachable 123 CIDR and Aggregate Addresses (1) AS 2 has the detailed routes AS 1 (3) AS 1 learns only the aggregate and not the details 192.168.0.0/24 192.168.1.0/24 192.168.2.0/24 192.168.3.0/24 Router A Router B 2.2.2.2 3.3.3.2 192.168.0/22 192.168.0/22 2.2.2.1 3.3.3.1 Router C AS 2 (2) BGP with Routing Policy can advertise a prefix that aggregates the detailed routes AS 3 124 IBGP, EBGP Example AS 1 EBGP AS 3 AS 2 EBGP IBGP 125 Advertising Networks Using the Network command Redistributing static routes Redistributing Dynamic routes 126 Advertising Networks Using Network Command Router A 11.0.0.0 12.0.0.0 router bgp 1 neighbor 1.1.1.2 remote-as 2 network 11.0.0.0 network 12.0.0.0 Router B router bgp 2 neighbor 1.1.1.1 remote-as 1 network 92.0.0.0 network 93.0.0.0 AS1 A EBGP 92.0.0.0 93.0.0.0 B AS2 127 Advertising Networks By redistributing Static Routes 11.0.0.0 12.0.0.0 A AS1 Router A router bgp 1 neighbor 1.1.1.2 remote-as 2 redistribute static ip route 11.0.0.0 255.0.0.0 null 0 ip route 12.0.0.0 255.0.0.0 null 0 EBGP 92.0.0.0 93.0.0.0 B AS2 128 Advertising Networks By Redistributing Dynamic Routes 11.0.0.0 12.0.0.0 A AS1 Router A router bgp 1 neighbor 1.1.1.2 remote-as 2 redistribute ospf 1 EBGP 92.0.0.0 93.0.0.0 router ospf 1 network 11.0.0.0 0.255.255.255 area 0 B AS2 129 BGP Attributes AS-path Next-hop Local preference MED Origin Communities 130 BGP Attributes AS-Path traversed one or more members of a set {1880, 1881, 1882} (as-set) A list of AS’s that a route has traversed 1880 1883 (sequence) Shortest AS path preferred 1883 193.0.32/24 Path 1880 193.0.34/24 1881 193.0.33/24 1882 193.0.35/24 193.0.33/24 1880 1881 193.0.34/24 1880 193.0.35/24 1880 1882 193.0.32/22 1880 1983 131 BGP Attributes Multi-Exit Discriminator (MED) 690 1883 1755 200 1880 209 Preference sent to all routers in remote AS Where do I want to receive the traffic ? 132 Multi-Exit Discriminator (MED) Indication to external peers of the preferred path into an AS. Affects routes with same AS path. Advertised to external neighbors Usually based on IGP metric * Lowest MED preferred 133 MED Attribute (2) The MED (multi-exit discriminator) is a commonly used attribute. It comes after the AS_PATH in evaluation, and thus isn’t quite as much of a “hammer” as local-pref. Commonly, MED is used to tack a distance on BGP routes as they move within your network. NSPs advertise MEDs to each other to let it be known which POP the route is “closest” to. 134 BGP Attributes Local Preference 690 1755 1880 A Needs to go to 690 666 Preference sent to all routers in local AS Where do I want traffic to leave? 102 NW’98 135 © 1998, Cisco Systems, Inc. 135 Local Preference Attribute AS 3847 F G E C 208.1.1.0/24 * D 80 Local to AS Used to influence BGP path selection Default 100 Highest local-pref preferred 208.1.1.0/24 100 Preferred by all AS3847 routers A B 208.1.1.0/24 AS 6201 136 Local-Pref Attribute (2) An often-used attribute, local-pref (normally 100) overrides AS_PATH, and is transitive throughout your network. It is never advertised to an eBGP peer. For example, you can express the policy “prefer private interconnects” by making the local_pref be 150 and leaving all other peers at 100. Best used as an intermediate-level knob. 137 The BGP Path Decision Algorithm BGP determines the best path to each destination for a BGP speaker by comparing path attributes according to the following selection sequence: Select a path with a reachable next hop. 2. Select the path with the highest weight. 3. If path weights are the same, select the path with the highest local preference value. 4. Prefer locally originated routes (network routes, redistributed routes, or aggregated routes) over received routes. 5. Select the route with the shortest AS-path length. 6. If all paths have the same AS-path length, select the path based on origin: IGP is preferred over EGP; EGP is preferred over Incomplete. 7. If the origins are the same, select the path with lowest MED value. 8. If the paths have the same MED values, select the path learned via EBGP over one learned via IBGP. 9. Select the route with the lowest IGP cost to the next hop. 10. Select the route received from the peer with the lowest BGP router ID. 1. 138 Common Internet Routing Phenomemon E-BGP Route Flapping/Oscillation Remedy: Route Flap Dampening 139 BGP - Route Flapping Routing instability Routes disappear, appear again, then disappear Visible to the Internet Withdrawal, announcement, withdrawal, announcement Waste resources Some causes of route flapping Flaky inter-AS links Flaky or insufficient hardware Link congestion IGP instability Operator error 140 BGP – Route Flap Dampening If you are running BGP version 4, the BGP process assigns a penalty of 1000 to the route each time it flaps. When the penalty value exceeds the first of two limits (Re-use limit, Suppress limit), the route is moved into the 'historical' list of routes, dampened, and then is no longer accepted from other peers or announced to any peers. After the first limit has been exceeded, the timer which tracks the period for which the route is to be dampened is doubled for each flap. The suppression half-life is 15 minutes. The maximum suppress limit is four times the half-life; thus, one hour is the default. The suppression penalty decays at half the half life (7.5 minutes). So: 1. 2. 3. 4. First flap, penalty 1000 assigned, route placed in 'historical' category and becomes less preferred. Second flap, route has met the suppression limit of 2000 (a Cisco default). The route is dampened and no longer advertised to neighbors or accepted from neighbors. If route does not flap any further the penalty is decayed. The decay process begins 7.5 minutes after the route stabilized and decays exponentially every 5 seconds thereafter. Once the suppression penalty decays below 750 (the default value for the reuse threshold), the route is removed from dampened state and reused. The router parses the historical routes list every 10 seconds for reusable routes. 141 Route Flap Dampening - Operation 142 Useful Tool To Understand BGP Peering Relationships www.netlantis.org 143 UDP, TCP/IP - Internet Protocols UDP: User Datagram Protocol UDP Header 146 What is UDP? Relatively simple compared to TCP UDP provides connectionless service for data delivery between two hosts No logical connection is established by UDP, so no connection-oriented services are supplied The application may need some of those services (e.g., no corrupt packets), so the application is responsible to provide them Applications that use UDP: tftp, DNS (for some functions), SNMP, RIP, VoIP 147 Why Use UDP over TCP? No connection establishment. As we shall discuss in Section 3.5, TCP uses a three-way handshake before it starts to transfer data. UDP just blasts away without any formal preliminaries. Thus UDP does not introduce any delay to establish a connection. This is probably the principle reason why DNS runs over UDP rather than TCP -- DNS would be much slower if it ran over TCP. HTTP uses TCP rather than UDP, since reliability is critical for Web pages with text. But, as we briefly discussed in Section 2.2, the TCP connection establishment delay in HTTP is an important contributor to the "world wide wait". No connection state. TCP maintains connection state in the end systems. This connection state includes receive and send buffers, congestion control parameters, and sequence and acknowledgment number parameters. We will see in Section 3.5 that this state information is needed to implement TCP's reliable data transfer service and to provide congestion control. UDP, on the other hand, does not maintain connection state and does not track any of these parameters. For this reason, a server devoted to a particular application can typically support many more active clients when the application runs over UDP rather than TCP. Small segment header overhead. The TCP segment has 20 bytes of header overhead in every segment, whereas UDP only has 8 bytes of overhead. Unregulated send rate. TCP has a congestion control mechanism that throttles the sender when one or more links between sender and receiver becomes excessively congested. This throttling can have a severe impact on real-time applications, which can tolerate some packet loss but require a minimum send rate. On the other hand, the speed at which UDP sends data is only constrained by the rate at which the application generates data, the capabilities of the source (CPU, clock rate, etc.) and the access bandwidth to the Internet. We should keep in mind, however, that the receiving host does not necessarily receive all the data - when the network is congested, a significant fraction of the UDPtransmitted data could be lost due to router buffer overflow. Thus, the receive rate is limited by network congestion even if the sending rate is not constrained. 148 UDP Encapsulation 149 UDP Header 150 Computing the UDP Checksum 151 IP Fragmentation 152 Other UDP Uses Path MTU Discovery Using Traceroute Using UDP Max UDP datagram size ICMP Source Quench 153 TCP: Transmission Control Protocol TCP Header 155 What is TCP? Relatively complex compared to UDP TCP provides connection-oriented service for data delivery between two hosts Client and server establish a logical TCP connection before exchanging data TCP segments flow over the network in IP packets (which are connectionless) so that the logical TCP connection can be maintained over a changing physical path Connections are full-duplex Timers are used to maintain connections TCP relies on IP to provide hop-by-hop routing and error detection Applications that use TCP: telnet, ftp, http, many others 156 TCP Logical Connections/Ports TCP and UDP introduce the concept of ports Common ports and the services that run on them: FTP telnet SMTP http POP3 21 and 20 23 25 80 110 Multiple ports/logical connections can be supported 157 TCP Header Fields and Other Info Each connection uniquely identified by combination of src IP, dest IP, src port, and dest port socket = IP + port (e.g., 10.1.1.1.23) Sequence numbers The sequence number of the first data octet in this segment (except when SYN is present). If SYN is present the sequence number is the initial sequence number (ISN) and the first data octet is ISN+1. Acknowledgements If the ACK control bit is set this field contains the value of the next sequence number the sender of the segment is expecting to receive. Once a connection is established this is always sent. Header Length: Length of header in bytes Flag bits: Provides connection-oriented service The SYN and Fin flags are used when establishing and terminating a TCP connection, respectively. The ACK flag is set any time the Acknowledgement field is valid, implying that the receiver should pay attention to it. The URG flag signifies that this segment contains urgent data. When this flag is set, the UrgPtr field indicates where the non-urgent data contained in this segment begins. The PUSH flag signifies that the sender invoked the push operation, which indicates to the receiving side of TCP that it should notify the receiving process of this fact. Finally, the RESET flag signifies that the receiver has become confused and so wants to abort the connection. Window size The number of data octets beginning with the one indicated in the acknowledgment field which the sender of this segment is willing to accept. 158 TCP Options 159 TCP Encapsulation 160 TCP Functionalities Connection-Orient Identifies traffic flow by some identifier rather than by explicitly listing source and destination addresses Stream Data Transfer From the application's viewpoint, TCP transfers a contiguous stream of bytes. TCP does this by grouping the bytes in TCP segments, which are passed to IP for transmission to the destination. TCP itself decides how to segment the data and it may forward the data at its own convenience. Reliability TCP assigns a sequence number to each byte transmitted, and expects a positive acknowledgment (ACK) from the receiving TCP. If the ACK is not received within a timeout interval, the data is retransmitted. The receiving TCP uses the sequence numbers to rearrange the segments when they arrive out of order, and to eliminate duplicate segments. Flow Control The receiving TCP, when sending an ACK back to the sender, also indicates to the sender the number of bytes it can receive beyond the last received TCP segment, without causing overrun and overflow in its internal buffers. This is sent in the ACK in the form of the highest sequence number it can receive without problems. Logical Connections The reliability and flow control mechanisms described above require that TCP initializes and maintains certain status information for each data stream. The combination of this status, including sockets, sequence numbers and window sizes, is called a logical connection. Each connection is uniquely identified by the pair of sockets used by the sending and receiving processes. Multiplexing To allow for many processes within a single host to use TCP communication facilities simultaneously, the TCP provides a set of addresses or ports within each host. Concatenated with the network and host addresses from the internet communication layer, this forms a socket. A pair of sockets uniquely identifies each connection. Full Duplex TCP provides for concurrent data streams in both directions. 161 Important Factors That Affect TCP (Application) Performance Link BW (window size), network delay (RTT) and MTU size (Bit Error Rate) How a receiver/sender implements Acknowlegment scheme Speed at which received data is processed and ACKed at destination (sender/receiver buffer size) Maximum TCP Buffer (Memory) space for use by any TCP connection Socket Buffer Sizes for individual TCP connection Ability to actively manage speed mismatch between sender and receiver by regulating how much can/should be sent without getting into a congestion state Window size based on Bandwidth Delay Product Ability to actively detect and prevent dynamic congestion and re-act to it Stop-N-Wait, Go-back-N, Selective Ack Connection timeout, slow start, back-off strategies Ability for sender to maximize/improve BW/network throughput TCP Large Window Scaling 162 TCP Connection-Oriented Protocol: TCP Connection Establishment and Termination Passive and Active Ports TCP enables two methods to establish a connection: active and passive. An active connection establishment happens when TCP issues a request for the connection, based on an instruction from an upperlevel protocol that provides the socket number. A passive approach takes place when the upper-level protocol instructs TCP to wait for the arrival of connection requests from a remote system (usually from an active open instruction). When TCP receives the request, it assigns a port number. This enables a connection to proceed rapidly, without waiting for the active process. 164 Connection Establishment 165 Connection Termination 166 MSS: Maximum Segment Size 167 168 169 TCP Half-Close 170 TCP State Diagram 171 States: Establishment and Termination 172 TCP Reset 173 Simultaneous Open For example: An application at host A uses 7777 as the local port and connects to port 8888 on host B. At the same time, an application at host B uses 8888 as the local port and connects to port 7777 on host A. This is "Simultaneous Open". Here is another example: The Telnet client at host A connects to the Telnet server at host B. At the same time, the Telnet client at host B connects to the Telnet server at host A. Be careful. This time, it's not "Simultaneous Open" because the two Telnet servers on both sides do "passive open" instead of "active open". There are actually two TCP connections, instead of one in "Simultaneous Open". 174 Simultaneous Close 175 TCP Provides a Byte-Stream Service TCP is a byte-oriented protocol, which means the sender writes bytes into a TCP connection and the receiver reads bytes out of the TCP connection. Although ``bytestream'' describes the service TCP offers to application processes, TCP does not, itself, transmit individual bytes over the Internet. Instead, TCP on the source host buffers enough bytes from the sending process to fill a reasonably sized packet, and then sends this packet to its peer on the destination host. TCP on the destination host then empties the contents of the packet into a receive buffer, and the receiving process reads from this buffer at its leisure. This situation is illustrated in figure below, which for simplicity, shows data flowing in only one direction. In general, remember, a single TCP connection supports byte-streams flowing in both directions. 176 TCP Provides a Byte-Stream Service It is a streaming protocol No “record markers” inserted into data stream Writes on one end and reads on the other are independent of each other Ex: data could be written in a sequence of 10 bytes, then 20 bytes, then 50 bytes That data could be read as 4 x 20 byte No interpretation of the application data Very similar to Unix kernel’s treatment of files in a filesystem 177 TCP Provides Reliability TCP segments are sized for the application Segments must be acknowledged TCP checksum on header and data Out-of-sequence IP packets can be reordered Receiving TCP must discard duplicate packets Flow control is employed to manage finite buffer space 178 TCP Flow Control Algorithms Sliding Window Regular Sliding Window The BW Delay Product (BWDP) Window Scaling option The Silly Window syndrome Tinygram Congestion Prevention: The Nagle algorithm TCP Timeout RTT-Timeout Calculation Acknowlegement schemes Stop-N-Wait, Go-back-N, Selective Ack schemes Ambiguous Acknowledgements: The Karn’s Algorithm Congestion Avoidance Slow Start 179 End-to-end flow control Problem Sender can send more traffic that receiver can handle. (Too fast) Solution variable sliding window protocol each acknowledgement, which specifies how many octets have been received, contains a window advertisement that specifies how many additional octets receiver are prepared to accept. 180 Variable Window Size … … Window Advertisement Receiver Transmitter Transmitter Window Size Value of Window Advertisement Free space in buffer to fill increase bigger increase decrease smaller decrease Stop transmissions 0 full 181 Sliding window protocol in TCP TCP allows the window size to vary over time. Window size changes at the time it slides forward. Advantage: it provides flow control as well as reliable transfer. 182 TCP Sliding Window Algorithm Flow control with the use of Window concept via specifying an acceptable range of sequence numbers 183 Sliding Windows Advertized Window Available Window 184 Sliding Window: Details Sender Max ACK received Receiver Next expected Next seqnum … … … … Sender window Sent & Acked Sent Not Acked OK to Send Not Usable Max acceptable Receiver window Received & Acked Acceptable Packet Not Usable 185 Window Flow Control: Header Packet Received Packet Sent Source Port Dest. Port Source Port Dest. Port Sequence Number Sequence Number Acknowledgment Acknowledgment HL/Flags Window HL/Flags Window D. Checksum Urgent Pointer D. Checksum Urgent Pointer Options.. Options.. App write acknowledged sent to be sent outside window 186 Sliding Windows Example - A Dynamic Parameter 187 Window Size – A Dynamic Parameter 188 How To Determine Optimal Window Size (Keeping the Pipe Full): The Bandwidth-Delay Product (BDP) 189 The TCP Window Scaling option The original TCP specification included a window size no larger than 64 KB. This limitation was introduced by the 16 bit header that specified window size. To achieve the recommended 1 MB window size, TCP extensions must be enabled to add another 14 bits to the window size,making the total bist equal to 30 for window size The TCP window scaling option works by including a scale factor in a SYN packet imbedded in the TCP OPTIONS field. This scale factor informs the receiver that the sender is willing to do window scaling and offers a scale factor for the communication. The scale factor is used to shift the window field before the data segment is sent. It's important to note that the window size used in the actual 3way handshake is NOT the window size that is scaled. This means that the first data packet sent after the 3-way handshake is the actual window size. If there is a scaling factor, the initial window size of 65,535 bytes is always used. The window size is then multiplied by the scaling factor identified in the 3-way handshake. The table below represents the scaling factor boundaries for various window sizes. 190 The TCP Window Scaling Option Scale factor Scale Value Initial Window 0 1 65535 or less 1 2 65535 2 4 65535 3 8 65535 Window Scaled 65535 or less 131,070 262,140 524,280 4 5 6 65535 65535 65535 1,048,560 … … 65535 65535 … 1,073,725,4 40 … 14 16 16384 191 TCP Options – Window Scale Factor 192 The TCP Receiver Silly Window Syndrome Problems Associated With "Shrinking" The TCP Window - when the receiver is much slower than the sender (ex below where receiver can only process 1 out of 3 received packets) This diagram shows one example of how the phenomenon known as TCP silly window syndrome can arise. The client is trying to send data as fast as possible to the server, which is very busy and cannot clear its buffers promptly. Each time the client sends data the server reduces its receive window. The size of the messages the client sends shrinks until it is only sending very small, inefficient segments. 193 Receiver SWS Avoidance Let's start with SWS avoidance by the receiver. As we saw in the initial example above, the receiver contributed to SWS by reducing the size of its receive window to smaller and smaller values due its being busy. This caused the right edge of the sender's send window to move by ever-smaller increments, leading to smaller and smaller segments. To avoid SWS, we simply make the rule that the receiver may not update its advertised receive window in such a way that this leaves too little usable window space on the part of the sender. In other words, we restrict the receiver from moving the right edge of the window by too small an amount. The usual minimum that the edge may be moved is either the value of the MSS parameter, or one-half the buffer size, whichever is less. Let's see how we might use this in the example above. When the server receives the initial 360-byte segment from the client and can only process 120 bytes, it does not reduce the window size to 120. It reduces it all the way to 0, closing the window. It sends this back to the client, which will then stop and not send a small segment. Once the server has removed 60 more bytes from the buffer, it will now have 180 bytes free, half the size of the buffer. It now opens the window up to 180 bytes in size and sends the new window size to the client. It will continue to only advertise either 0 bytes, or 180 or more, not smaller values in between. This seems to slow down the operation of TCP, but it really doesn't. Because the server is overloaded, the limiting factor in overall performance of the connection is the rate at which the server can clear the buffer. We are just exchanging many small segments for a few larger ones. 194 Tinygram Congestion Prevention: The Nagle algorithm (Sender SWS) Nagle’s algorithm, named after John Nagle, is a means of improving the efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network. Telnet example test over a long-haul link with a 5-second round trip time. User sends 25 bytes. Without any mechanism to prevent small-packet (tinygram) congestion 25 new packets would be sent in 5 seconds in accordance with the delayed ACK algorithm, meaning delay up to 200 ms. Amount data to be sent is 41x25 bytes (20 bytes for IP header, 20 bytes for TCP header). Overhead here is 4000%. With Nagle algorithm however, the first character from the user would be sent immediately. The next 24 characters, arriving from the user at 200ms intervals. When an ACK arrived for the first packet at the end of 5 seconds, a single packet with the 24 queued characters would be sent, i.e. 41x2 +25 bytes would be sent total. Overhead is only 320% with no penalty in response time. The Nagle Algorithm is useful on a slow WAN when it is desired to reduce the tinygram congestion. Sometimes the Nagle algorithm needs to be turned off. For example, in X Window System server small messages (mouse movements) must be delivered without delay to provide real-time feedback for interactive user. 195 Key TCP Concepts Modern TCP implementations incorporate a set of SWS avoidance algorithms. When receiving, devices are programmed not to advertise very small windows, waiting instead until there is enough room in the buffer for one of a reasonable size. Transmitters use Nagle’s algorithm to ensure that small segments are not generated when there are unacknowledged bytes outstanding. 196 TCP Timers The Retransmission Timer The retransmission timer manages retransmission timeouts (RTOs), which occur when a preset interval between the sending of a datagram and the returning acknowledgment is exceeded. The value of the timeout tends to vary, depending on the network type, to compensate for speed differences. If the timer expires, the datagram is retransmitted with an adjusted RTO, which is usually increased exponentially to a maximum preset limit. If the maximum limit is exceeded, connection failure is assumed, and error messages are passed back to the upperlayer application. Values for the timeout are determined by measuring the average time that data takes to be transmitted to another machine and the acknowledgment received back, which is called the round-trip time, or RTT. From experiments, these RTTs are averaged by a formula that develops an expected value, called the smoothed round-trip time, or SRTT. This value is then increased to account for unforeseen delays. The Delayed ACK Timer TCP uses delayed acknowledgments to reduce the number of packets that are sent on the media. Instead of sending an acknowledgment for each TCP segment received, TCP takes a common approach to implementing delayed acknowledgments. As data is received by TCP on a particular connection, it sends an acknowledgment back only if one of the following conditions is true: No acknowledgment was sent for the previous segment received. A segment is received, but no other segment arrives within 200 milliseconds for that connection. 197 TCP Timers (Continued) The Quiet Timer After a TCP connection is closed, it is possible for datagrams that are still making their way through the network to attempt to access the closed port. The quiet timer is intended to prevent the just-closed port from reopening again quickly and receiving these last datagrams. The quiet timer is usually set to twice the maximum segment lifetime (the same value as the Time to Live field in an IP header), ensuring that all segments still heading for the port have been discarded. Typically, this can result in a port being unavailable for up to 30 seconds, prompting error messages when other applications attempt to access the port during this interval. The Keep-Alive Timer and the Idle Timer Both the keep-alive timer and the idle timer were added to the TCP specifications after their original definition. The keep-alive timer sends an empty packet at regular intervals to ensure that the connection to the other machine is still active. If no response has been received after sending the message by the time the idle timer has expired, the connection is assumed to be broken. The keep-alive timer value is usually set by an application, with values ranging from 5 to 45 seconds. The idle timer is usually set to 360 seconds. 198 RTT and TCP Retransmission Timeouts (RTO) The RTO is typically calculated based on the RTT The original TCP specification had TCP update a smoothed RTT estimator (called R) using the low-pass filter R <- aR + (1-a)M, M is the measured RTT where a is a smoothing factor with a recommended value of 0.9. This smoothed RTT is updated every time a new measurement is made. Ninety percent of each new estimate is from the previous estimate and 10% is from the new measurement. Given this smoothed estimator, which changes as the RTT changes, RFC 793 recommended the retransmission timeout value (RTO) be set to RTO = Rb where b is a delay variance factor with a recommended value of 2. 199 TCP Performance Enhancement - Congestion Avoidance Using Slow Start Cwnd = congestion window Some implementation recommends cwnd = 4 as the initial value to improve throughput 200 The TCP Delayed ACK timer Delayed ACK timer = 200 msec by default 201 TCP Reliability – Use of Acknowledgement TCP is responsible for data recovery by providing a sequence number with each packet that it sends TCP requires ACK (acknowledgement) to ensure correct data is received Packet can be retransmitted if error detected Three ACK schemes available which will be discussed at length later 202 TCP Acknowledgement Schemes Stop-and-Wait : individual ACK is required for each segment received Go-Back-N : cumulative ACK required for consecutive segments received Selective-Ack: selectively ACK only segments that are in error as part of the consecutive segments received 203 Stop-And-Wait ACK Scheme: TCP Interactive Data Flow Interactive Data: character/echo 205 Delayed Acknowledgements Delayed ACK timer = 200 msec by default 206 Regular Acknowledgements 207 Stop-And-Wait ACK Scheme – Loss Packet Scenario 208 Stop-And-Wait ACK Scheme – Loss ACK Scenario 209 Performance of Stop-And-Wait ACK Scheme It works, but performance stinks Example: 1 Gbps link, 15 ms e-e prop. delay, 1KB packet: Ttransmit = U L (packet length in bits) 8kb/pkt = = 8 microsec R (transmission rate, bps) 10**9 b/sec = sender L/R RTT + L / R = .008 30.008 = 0.00027 microsec onds U sender: utilization – fraction of time sender busy sending 1KB pkt every 30 msec -> 33kB/sec thruput over 1 Gbps link network protocol limits use of physical resources! 210 Ambiguous ACK – Loss ACK Scenario 211 Ambiguous Acknowledgement - The Karn Algorithm Rule 1: Ignore measured RTT for retransmitted packets. This removes ambiguity from RTT measurements. Rule 2: RTO should be doubled after retransmission. This is called "Exponential Back-off". A problem occurs when a packet is retransmitted. Say a packet is transmitted, a timeout occurs, the RTO is backed off, the packet is retransmitted with the longer RTO, and an acknowledgment is received. Is the ACK for the first transmission or the second? This is called the retransmission ambiguity problem. [Karn and Partridge 1987] specify that when a timeout and retransmission occur, we cannot update the RTT estimators when the acknowledgment for the retransmitted data finally arrives. This is because we don't know to which transmission the ACK corresponds. (Perhaps the first transmission was delayed and not thrown away, or perhaps the ACK of the first transmission was delayed.) Also, since the data was retransmitted, and the exponential backoff has been applied to the RTO, we reuse this backed off RTO for the next transmission. Don't calculate a new RTO until an acknowledgment is received for a segment that was not retransmitted. 212 TCP Fast Retransmission Fast Retransmission: When TCP detects segment loss using retransmission timer, then ssthresh is set to half of CWND. i.e. ( ssthresh = CWND / 2 ) and CWND is set to 1-full size segment. Now instead of waiting for retransmission timer to get off , TCP detects packet loss by looking for packet re-ordering and retransmit the lost packet. This scheme is called as " Fast Retransmit". In this algorithm , TCP receiver sends an immediate duplicate ACK on out-of-order segment arrival. The other end TCP deduce from small number ( normally 3 ) of consecutive duplicate ACKs that the segment has been lost and deduces the starting sequence number of missing segment. The missing segment is retransmitted. 213 Go-Back-N ACK Scheme: TCP Bulk Data Flow Motivation: Pipelining Increases Utilization sender receiver first packet bit transmitted, t = 0 last bit transmitted, t = L / R first packet bit arrives last packet bit arrives, send ACK last bit of 2nd packet arrives, send ACK last bit of 3rd packet arrives, send ACK RTT ACK arrives, send next packet, t = RTT + L / R Increase utilization by a factor of 3 U sender = 3*L/R RTT + L / R = .024 30.008 = 0.0008 microsecon ds 215 Go-Back-N (GBN) - Sender Sender: k-bit seq # in pkt header “sliding window” of up to N, consecutive unack’ed pkts allowed ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK” may receive duplicate ACKs (see receiver) The local node must also keep a buffer of all PDUs which have been sent, but have not yet been acknowledged Timer for each in-flight pkt Timeout(n): retransmit pkt n and all higher seq # pkts in window 216 GBN: Receiver ACK-only: always send ACK for correctly-received pkt with highest in-order seq # may generate duplicate ACKs need only remember expectedseqnum out-of-order pkt: discard (don’t buffer) -> no receiver buffering! Re-ACK pkt with highest in-order seq # 217 GBN In Action 218 GBN – How To Handle Packet Loss Example of Go-Back-N. The sender in this example transmits four PDUs (1-4) and the first one (1) of these is not successfully received. The receiver notes that it was expecting a PDU numbered 1 and actually receives a PDU numbered 2. It therefore deduces that (1) was lost. It requests retransmission of the missing PDU by sending a Go-Back-N request (in this case N=1), and discards all received PDUs with a number greater than 1.The sender receives the Go-Back-N request and retransmits the missing PDU (1), followed by all subsequently sent PDUs (2-4) which the receiver the correctly receives and acknowledges. 219 GBN: ACK of Multiple Segments 220 GBN: ACK of Multiple Segments – Slight Differences 221 GBN: Fast Sender, Slow Receiver 222 Limitations With Stop-And-Wait and Go-Back-N Ack Schemes Poor performance when multiple packets are lost from a window of data Cumulative ACKs provide limited information Aggressively re-transmitting packets is inefficient Selective-ACK solves these problems 223