Networks and Protocols (320301) Lecture Notes Fall 2005 Jürgen Schönwälder August 27, 2005 School of Engineering and Science International University Bremen Preface The lecture “Networks and Protocols” is an introduction into the foundations of packet switched data communication networks. The lecture covers widely deployed Internet technologies and the IEEE 802 standards for local area networks. The selection of the material covered in this lecture is specifically dealing with widely deployed technologies and protocols. This approach has the disadvantage to ignore some interesting alternate technologies which did not become widely deployed for whatever (often non-technical) reason. However, the advantage of this approach is that more time can be spend to discuss the selected technologies to some level of detail and to enable and encourage students to experiment with their own network infrastructure. This usually increases the understanding of the material and the motivation. Some parts of the lecture notes date back to a lecture called “Introduction to Operating Systems and Networks” which I have given at the Technical University Braunschweig. These notes were later heavily revised and extended for a lecture called ”Computer Networks” which I have given at the University of Osnabrück. Some parts of these lecture notes are heavily influenced by standard text books such as [1, 2, 3, 4, 5, 6, 7] while other parts are directly derived from the relevant standards. Students who want to understand the discussed protocols in even more details are strongly encouraged to read the relevant parts of the standards which are referenced throughout the text. My thanks go to the many students who asked critical questions and provided constructive feedback which improved the presentation and reduced the amount of errors and inconsistencies. Jürgen Schönwälder Contents 1 2 3 Introduction 1.1 Fundamental Concepts . . . . . . 1.1.1 Services . . . . . . . . . . 1.1.2 Protocols . . . . . . . . . 1.1.3 Names and Addresses . . 1.1.4 ISO/OSI Reference Model 1.1.5 Internet Reference Model 1.2 Standardization . . . . . . . . . . 1.2.1 ISO Standardization . . . 1.2.2 Internet Standardization . 1.2.3 IEEE Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IEEE 802 Local Area Networks 2.1 Logical Link Control (IEEE 802.2) . . . . . . . 2.2 Ethernet (IEEE 802.3) . . . . . . . . . . . . . 2.2.1 Physical Layer (PHY) . . . . . . . . . 2.2.2 Medium Access Layer (MAC) . . . . . 2.2.3 Fast-Ethernet (IEEE 802.3u) . . . . . . 2.2.4 Gigabit Ethernet (IEEE 802.3z/802.3ab) 2.2.5 10 Gigabit Ethernet (IEEE 802.3ae) . . 2.3 Wireless LANs (IEEE 802.11) . . . . . . . . . 2.4 Bluetooth LANs (IEEE 802.15) . . . . . . . . 2.5 Port Access Control (IEEE 802.1X) . . . . . . 2.6 Bridges . . . . . . . . . . . . . . . . . . . . . 2.6.1 Source Routing Bridges . . . . . . . . 2.6.2 Transparent Bridges (IEEE 802.1D) . . 2.7 Virtual LANs (IEEE 802.1Q) . . . . . . . . . . 2.8 LAN Priorities (IEEE 802.1D) . . . . . . . . . Internet Network Layer 3.1 Fundamentals . . . . . . . . . . . 3.1.1 Evolution of the Internet . 3.1.2 Internet Design Principles 3.1.3 Basic Terminology . . . . 3.1.4 Autonomous Systems . . 3.1.5 Internet Address Scopes . 3.2 Internet Protocol Version 4 (IPv4) 3.2.1 IPv4 Addressesv4 Packet Format . . . . . . . . . . . 3.2.3 IPv4 Forwarding . . . . . . . . . . . . 3.2.4 IPv4 Error Handling (ICMPv4) . . . . 3.2.5 MTU Path Discovery . . . . . . . . . . 3.2.6 IPv4 over IEEE 802.3 . . . . . . . . . 3.2.7 IPv4 Adress Translation (ARP, RARP) 3.2.8 Automatic Configuration (DHCP) . . . Internet Protocol Version 6 (IPv6) . . . . . . . 3.3.1 IPv6 Addresses . . . . . . . . . . . . . 3.3.2 IPv6 Packet Format . . . . . . . . . . . 3.3.3 IPv6 Extensions . . . . . . . . . . . . 3.3.4 IPv6 Forwarding . . . . . . . . . . . . 3.3.5 IPv6 Error Handling (ICMPv6) . . . . 3.3.6 IPv6 over IEEE 802.3 . . . . . . . . . 3.3.7 IPv6 Neighbor Discovery . . . . . . . . Routing Protocols . . . . . . . . . . . . . . . . 3.4.1 Routing Information Protocol (RIP) . . 3.4.2 Open Shortest Path First (OSPF) . . . . 3.4.3 Border Gateway Protocol (BGP) . . . . Internet Transport Layer 4.1 Pseudo Header . . . . . . . . . . . . . . . . . 4.2 User Datagram Protocol (UDP) . . . . . . . . . 4.3 Transmission Control Protocol (TCP) . . . . . 4.3.1 Connection Establishment . . . . . . . 4.3.2 Connection Tear-down . . . . . . . . . 4.3.3 State Machine . . . . . . . . . . . . . . 4.3.4 Flow Control . . . . . . . . . . . . . . 4.3.5 Congestion Control . . . . . . . . . . . 4.3.6 Retransmission Timer . . . . . . . . . 4.4 Stream Control Transmission Protocol (SCTP) . 4.5 Datagram Congestion Control Protocol (DCCP) Internet Application Layer 5.1 Domain Name System (DNS) . . . . . . . . . 5.1.1 Format of Domain Names . . . . . . . 5.1.2 Resource Records . . . . . . . . . . . . 5.1.3 DNS Message Formats . . . . . . . . . 5.2 Abstract Syntax Notation One (ASN.1) . . . . 5.2.1 Basic Concepts . . . . . . . . . . . . . 5.2.2 ISO Registration Tree . . . . . . . . . 5.2.3 Primitive ASN.1 Data Types . . . . . . 5.2.4 Constructed ASN.1 Data Types . . . . 5.2.5 Restrictions on ASN.1 Data Types . . . 5.2.6 ASN.1 Tags . . . . . . . . . . . . . . . 5.2.7 Example ASN.1 Definition . . . . . . . 5.2.8 Basic Encoding Rules (BER) . . . . . . 5.2.9 Generic String Encoding Rulesimple Network Mangement Protocol (SNMP) . . . . 5.3.1 Foundations . . . . . . . . . . . . . . . . . . . Augmented Backus-Naur Form (ABNF) . . . . . . . . 5.4.1 Rule Names, Comments and Terminal Symbols 5.4.2 Operators . . . . . . . . . . . . . . . . . . . . 5.4.3 Core Definitions . . . . . . . . . . . . . . . . 5.4.4 ABNF in ABNF . . . . . . . . . . . . . . . . Simple Mail Transfer Protocol (SMTP) . . . . . . . . 5.5.1 Grundlagen . . . . . . . . . . . . . . . . . . . 5.5.2 Kommandos und Antworten . . . . . . . . . . 5.5.3 Nachrichtenk”opfe . . . . . . . . . . . . . . . 5.5.4 Multipurpose Internet Mail Extensions (MIME) Internet Message Access Protocol (IMAP) . . . . . . . 5.6.1 Identifikation und Zust”ande . . . . . . . . . . 5.6.2 Zust”ande . . . . . . . . . . . . . . . . . . . . 5.6.3 Kommandos . . . . . . . . . . . . . . . . . . 5.6.4 Tagging . . . . . . . . . . . . . . . . . . . . . 5.6.5 Nachrichtenformat . . . . . . . . . . . . . . . File Transfer Protocol (FTP) . . . . . . . . . . . . . . Hypertext Transfer Protocol (HTTP) . . . . . . . . . . 5.8.1 Persistent Connections and Pipelining . . . . . 5.8.2 Caching and Proxies . . . . . . . . . . . . . . 5.8.3 Negotiation . . . . . . . . . . . . . . . . . . . 5.8.4 Conditional Requests . . . . . . . . . . . . . . 5.8.5 Delta Encoding . . . . . . . . . . . . . . . . . 5.8.6 HTTP as a Substrateacket Capturing 111 A.1 BSD Packet Filter (BPF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 A.2 libpcap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.3 jpcap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B Sockets B.1 Socket Addresses . . . B.2 Communication Kinds B.3 Socket API Overview . B.4 Name Resolution . . . B.5 Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 115 117 118 119 120 Chapter 1 Introduction 1.1 Fundamental Concepts This section discusses some architectural concepts and introduces some basic terms. A very fundamental approach to deal with complex systems is to divide them into sub-systems with well-defined interfaces between those subsystems. Accordingly, networks are usually designed as layered systems where each layer is responsible to provide a certain function. The layering principle is a very fundamental principle for structuring communication systems. 1.1.1 Services The abstract set of functions provided by a given network are called the service realized by the network. The service provided by a network is usually defined by abstract terms to allow for multiple concrete programming interfaces. • A service is used by using one or more service primitives. Typical ISO/OSI service primitives are: – Request of a service (request) – Indication of the request of a service (indication) – Response to the request of a service (response) – Confirmation of the requested service (confirmation) • The interface, which is used to access the service primitives, is called a service access point. • Services are realized by so called (protocol) instances. The instances at layer N of a layered system are accordingly called N -instances. • Strict layering requires that N -instances may only use services realized by (N − 1)-instances to realize the layer N service. 1 2 CHAPTER 1. INTRODUCTION 1.1.2 Protocols A protocol is a set of functions which together realize a well-defined communication service (e.g., error-free ordered transmission of data from a sender to a receiver). • Protocols define the format and the semantics of the protocol data units (PDUs) exchanged between communicating parties. • Protocols specifically define the rules which have to be followed when creating or processing protocol data units. • The instantiation of a protocol at runtime is realized by a so called protocol instance. In the general case, it is possible to have multiple instances of a protocol running concurrently on a single system. • Specialized protocols have been developed for various application domains. Note that a certain service can be realized by multiple different protocols. • The specification of protocols can be either informal (plain English text) or formal. Formal protocol specifications usually use specification languages such as Lotos, Estelle or SDL which have been developed for this purpose. 1.1.3 Names and Addresses Protocol instances are usually identified by some sort of an address. Addresses for protocol instances that only exist once on a certain system are often also used to identify whole systems. In addition, more human friendly names are often used and mapped to addresses as needed. • A human friendly name is an identification of a system or protocol instance which is relatively easy for humans to to read and memorize. Well known examples are Internet domain names such as www.iu-bremen.de. • Names often have variable length and the name space is usually structured hierarchically. • Addresses on the other hand are identifications of protocol instances which are optimized for machine processing. A typical example are Internet Protocol (IP) addresses such as 212.201.48.1. • Addresses usually have a fixed length and are relatively compact since they are frequently transmitted. ISDN Addresses The Integrated Services Digital Network (ISDN) is the digital telecommunication network which is widely available in Europe. ISDN addresses which identify telecommunication equipments such as phones are structured according to the E.164 numbering plan defined by the ITU: • An international ISDN phone number consists of a maximum of 15 digits. The first digits contain the country code, followed by a national region code followed by the phone number within that region. 1.1. FUNDAMENTAL CONCEPTS 3 • An international ISDN phone number can be followed by an up to 40 digit long target identifier. • The common notation for international ISDN phone numbers starts with a + symbol followed by digits which can be grouped into blocks by using white space or other separator characters. An example would be +49 241 200 3587. Internet Addresses Internet network layer addresses have a fixed size. Depending on the protocol version (IPv4 or IPv6), these addresses are either 4 byte or 16 byte long. • Four byte IPv4 addresses are typically written as four decimal numbers separated by dots where every decimal number represents one byte (dotted quad notation). A typical example is the IPv4 address 212.201.48.1. • Sixteen byte IPv6 addresses are typically written as a sequence of hexadecimal numbers separated by colons (:) where every hexadecimal number represents two bytes. Leading nulls can be omitted and two consecutive colons can represent a sequence of nulls. For example, the IPv6 address 1080:0:0:0:8:800:200C:417A can be written somewhat shorter as 1080::8:800:200C:417A. IPv6 addresses which contain IPv4 addresses can be written by using the dotted quad notation for the IPv4 address portion. For example, the IPv6 address 0:0:0:0:0:0:0D01:4403 can be written as ::0D01:4403 as well as ::13.1.68.3. Further details about IPv6 addresses can be found in RFC 3513 [8]. A more compact representation of IPv6 addresses can be found in RFC 1924 [9] (recommended reading). In this context, also RFC 1925 [?] is highly recommended background reading material. IEEE 802 MAC Addresses IEEE 802 addresses, sometimes also called MAC addresses, are usually 6 bytes or 48 bit long. (There are also 2 byte or 16 bit IEEE 802 addresses which however do not play a significant role.) • The common notation for IEEE 802 addresses is a sequence of hexadecimal numbers (one number for each address byte) where the numbers are separated from each other using colons or hyphens. Typical examples are 00:D0:59:5C:03:8A or 00-D0-59-5C-03-8A. • The highest bit of an IEEE 802 address indicates whether it is a normal unicast address (0) or a multicast address (1). The broadcast address, which represents all stations within a broadcast domain, consists of 48 one bits. • The second highest bit of an IEEE 802 address defines whether it is a local address (1) or a global address (0). A local address is assigned administratively and only unique within this administrative region while global addresses are globally unique. • Globally unique IEEE 802 addresses are created by vendors who have to apply for a number space by the IEEE. The vendor then assigns a unique number taken from the address space delegated to him. It is thus possible to identify the vendor of a network device by looking up the vendor code (the first three bytes) in a number space delegation list. 4 CHAPTER 1. INTRODUCTION Internet Domain Names Internet addresses are optimized for machine processing and storage and not necessarily for human memories. This lead to the introduction of names which are more oriented towards the requirements of human beings. virtual Root nl de edu org net com Toplevel 2nd Level iu−bremen www biz eecs 3rd Level www 4th Level Figure 1.1: Structure of Domain Name System (DNS) Names • The Domain Name System (DNS) defines a distributed hierarchical name space which in particular supports the delegation of name assignments. • In many cases, the structure of the DNS name space reflects the organizational structure of the organization which maintains the relevant part of the DNS name space. • When using DNS names to refer to a node on the Internet, a process called name resolution is performed which translates the DNS name to one or more IP addresses. • The traditional and widely deployed DNS does not support internationalized domain names. A special encoding has therefore been defined recently to support internationalized domain names without any changes to the DNS infrastructure. 1.1.4 ISO/OSI Reference Model The ISO/OSI Reference Model is the classic layered model for communication networks which was developed during the ISO work on the Open Systems Interconnection (OSI). Real networks usually do not follow strictly the seven layer OSI model. Physical Layer • Transmission of a sequence of bits over a transmission media. • Definition of the properties of the physical media. • Representation of the binary values 0 and 1 (e.g., voltages, frequencies). • Synchronization between sender and receiver. • Definition of standards for connectors and sockets. 5 Application Process Application Process End System End System Application Presentation Presentation Session Session Transport Transport Transitsystem Transport System Network Network Network Data Link Data Link Data Link Physical Physical Physical Medium Media Figure 1.2: ISO/OSI reference model Data Link Layer • Transmission of larger bit sequences in so called frames. • Data transfer between systems connected to the same medium. • Detection and correction of transmission errors. • Flow control to adapt the speed between senders and receivers. • Realization usually in hardware. Network Layer • Determination of paths through a complex communication network. • Multiplexing of end system connections over intermediate systems. • Error detection and correction between sending and receiving network nodes. • Flow and congestion control between end systems. • Transmission of datagrams or packets in packet switched networks. Transport Layer • End-to-end communication channels between applications. Transport System Application System Application Aplication System 1.1. FUNDAMENTAL CONCEPTS 6 CHAPTER 1. INTRODUCTION • Virtual connections over connection-less datagram services in packet switched networks. • Error detection and correction between transport layer endpoints. • Flow and congestion control between transport layer endpoints. Session Layer • Synchronization and coordination of communicating processes. • Interaction control (check points). Presentation Layer • Harmonization of different data representations. • Serialization of complex data structures. • Data compression. Application Layer • Realization of fundamental application oriented services. • Examples: Terminal emulationen, management of name spaces, data base access, network management, electronic messaging systems, process and machine control, . . . 1.1.5 Internet Reference Model Application Process Application Process End System End System Application Application Transport Transport Transit System Internet Network Internet Subnetwork Subnetwork Subnetwork Medium Medium Figure 1.3: Internet reference model 7 1.2. STANDARDIZATION • The Internet has been designed as a network which can be implemented on top of almost any other communication network by making very few assumptions about the services provided by the underlying communication networks. Accordingly, the layer below the Internet layer (which basically corresponds to the network layer of the OSI reference model) is called a subnetwork (see RFC 1149 [10] for an interesting example of a subnetwork). • The Internet Protocol (IP) provides a common basis which allows to cross boundaries imposed by various other network technologies. • The Internet Protocol can of course also be used as a subnetwork technology, which naturally leads to so called IP tunnels. • There are currently two protocols on the Internet network layer. The currently widely deployed IP protocol is version 4 (IPv4). The IP protocol version 6 (IPv6) is slowly gaining deployment and practical importance. • Internet protocols are often designed to simplify implementations (usually in software, even though high-speed devices implement many protocols in hardware). • The Internet protocols are primarily designed for data communication (asynchronous, besteffort) and only recent work tries to support voice and multi-media communication (isochronous traffic and quality of service). • Implementations of many Internet protocols are freely available which helps to transfer the protocols from research/development into actual products. Universities and research labs traditionally play a big role as a melting pot and experimentation field for new protocols. 1.2 Standardization The standardization of protocols creates unified network architectures supporting open (that is vendor independent) communication. Vendor specific protocols and architectures (e.g., SNA or DECnet) have lost importance. Activity Time for Standardization Research Investment Time Figure 1.4: Theory of Standardization Standardization itself is a complicated and in most cases a time consuming and thus expensive process. However, once an open standard has been established, it can create an open competitive market which leads to the development of high-quality products which are usually available at very 8 CHAPTER 1. INTRODUCTION reasonable prices. However, only a very small fraction of the developed standards are actually successful in terms of wide-spread deployment: • The success of a standard must be measured in the number of actually deployed interoperable implementations. • Standards must allow vendors to differentiate their products. • Successful standards create an open market for new products. • One critical factor for the success of a standards activity is the timing. There are many organizations which develop standards for communication networks. The most important organizations and their standards processes are briefly introduced in the following subsections. 1.2.1 ISO Standardization • The International Organization for Standardization (ISO) is an organization for establishing international standards. ISO standards cover a wide spectrum of things, such as paper sizes or screws. Note that the abbreviation ISO stems from the Greek word isos, meaning ”equal”. • ISO is a network of the national standards institutes of almost 150 countries, on the basis of one member per country (ANSI for the USA, DIN for Germany), with a Central Secretariat in Geneva, Switzerland, that coordinates the system. • The ISO standardization process distinguished three states: 1. Draft Proposal (DP) 2. Draft International Standard (DIS) 3. International Standard (IS) The transition between these states requires majorities during voting processes and transitions can be repeated multiple times. • Standards are identified by numbers. Different revisions of the same standard are published under the same number. To distinguish the revisions, the year of the publication is usually appended to the number of a standard. • The Open Systems Interconnection (OSI) maintains the standards which deal with communication in open (communication) systems. 1.2.2 Internet Standardization • The Internet Engineering Task Force (IETF) is responsible for the standardization of the Internet protocols (RFC 3233 [11], RFC 2026 [12]). • Internet standards are usually developed by working groups (WGs) which are organized in different areas (e.g., routing or transport). 9 1.2. STANDARDIZATION Historic Working Group Document (Internet Draft) Proposed Standard Historic Draft Standard (RFC) (RFC) Historic Internet Standard (RFC) Figure 1.5: Internet standardization process model • Every area is lead by usually two area directors (ADs). All the area directors together form the Internet Engineering Steering Group (IESG), which has to approve all documents on the standardization track. • The IETF standardization process distinguishes three states: 1. Proposed Standard 2. Draft Standard 3. Internet Standard The transitions between these states require usually “rough consensus and running code.” Multiple interoperable independent implementations are required to move from Proposed Standard to Draft Standard and real-world deployment is required to move from Draft Standard to Internet Standard. • All standards are published as so called Request for Comments (RFCs). Every RFC has a unique number and RFCs are never changed after publication. Different revisions of a standard thus have different RFC numbers. There are special documents which help to locate the current RFC number for a given standard. Note: Not all RFCs are standards! There are also informational and experimental RFCs as well as RFCs which document best current practices. • The Internet Architecture Board (IAB) is a panel which looks at longer-term architectural issues and sometimes gives advise to the IETF. • The Internet Research Task Force (IRTF) is an organization that exists in parallel to the IETF and which looks at research questions, potentially preparing future standardization work. The IRTF is similarly to the IETF structured into research groups. The chairs of the research groups together form the Internet Research Steering Group (IRSG). 1.2.3 IEEE Standardization Standardization within the IEEE is organized and controlled by the IEEE-SA Standards Board. The documents created by standardization activities fall into the following categories: • Standards are documents which define IEEE Standards. • Recommended Practices can define procedures. • Guides discuss alternate approaches and can provide additional background information. 10 CHAPTER 1. INTRODUCTION • Trial-Use Documents exist only for a limited period of time. An IEEE standardization project can produce different classes of documents: • A new document (New) defines a standard which is not a revision of an already existing standard. • An already existing standard can be updated and replaced by a document which is called a Revision. • A Corrigenda is a document which makes substantial corrections in another standards document. • An existing standard can be extended by another document which can also make substantial corrections. Such a document is called an Amendment. The IEEE is called a sponsor and responsible for the creation and process management of a standardization project. A project starts by submitting a Project Authorization Request (PAR). The IEEE-SA Standards Board is the board which decides whether a PAR is accepted. PARs are evaluated by the New Standards Committee (NesCom). Technical work takes places in so called working groups and is finalized by a voting procedure (ballot). It is generally desired to avoid negative votes by achieving consensus before the final ballot. After a successful ballot, the draft of the new standard is submitted to the IEEE-SA Standards Board for approval. The IEEE-SA Standards Board itself makes use of a Review Committee (RevCom) which helps to review the documents and to form an opinion. Chapter 2 IEEE 802 Local Area Networks The 802.x series of IEEE standards are under development since the middle of the 1980s. They dominate the technology used in local area networks (LANs) and there are currently a trend to use the IEEE 802.x specifications also in metropolitan area networks (MANs). Some of the IEEE standards have also been approved as official ISO standards. 802.1 Management 802 Overview and Architecture 802.2 Logical Link Control 802.1 Bridging 802.3 Medium Access 802.4 Medium Access 802.5 Medium Access 802.6 Medium Access Ethernet Token Bus Token Ring DQDB 802.3 Physical 802.4 Physical 802.5 Physical 802.6 Physical 802.9 Medium Access 802.11 Medium Access 802.12 Medium Access WaveLan 802.9 Physical 802.11 Physical 802.12 Physical Figure 2.1: Overview over the IEEE 802 standards The currently most widely known standards are the Ethernet (IEEE 802.3) and WaveLANs (IEEE 802.11). An IEEE standard for bluetooth was approved in March 2002. The IEEE 802.x standards cover the two lower layers of the OSI reference model. However, the IEEE 802.x standards subdivide the OSI data link layer into two sub-layer: • The Logical Link Control (LLC) layer provides a service interface which is the same for all IEEE 802 protocols. Protocols on the network layer (e.g., the Internet Protocol) use the services provided by the LLC layer and thus work (in principle) over all IEEE 802.x protocols. (In reality, there are sometime differences with regard to the LLC layer service primitives supported by a given IEEE 802.x technology that can affect the mapping of network layer protocols.) • The Medium Access Control (MAC) layer defines the method used to access the media being used. 11 12 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS Application Process End−System Application System Application Representation Session Transport Data Link Logical Link Control (LLC) Media Access Control (MAC) Physical IEEE 802 Transport System Network Physical (PHY) Figure 2.2: IEEE 802 layers in the OSI reference model • The Physical (PHY) layer defines the physical properties for the various transmission media that can be used with a certain IEEE 802.x protocol. The split of the data link layer into two sub-layer has been a very important decision which enabled the IEEE to standardize very different media access technologies and protocols with a common data link interface. 2.1 Logical Link Control (IEEE 802.2) The Logical Link Control layer is modeled after the ISO service model and provides services that are close to those offered by the HDLC protocol discussed in the second year lecture “Operating Systems and Networks”. Note that not all services are realized by all existing IEEE 802.x protocols. 2.2 Ethernet (IEEE 802.3) The IEEE 802.3 standard is probably better known as Ethernet1 . The Ethernet technology was developed in the 1970s at XEROX PARC [13] and was later standardized with little changes by the IEEE [14]. The classic IEEE 802.3 network is a 1-persistent CSMA/CD network with a bandwidth of 1-10 Mbps. 1 The term Ethernet is usually used synonymously for the IEEE 802.3 standards and the CSMA/CD technology in general, although this is not really correct. 13 2.2. ETHERNET (IEEE 802.3) 1976 1990 1995 1998 2002 2006* 2008* 2010* Original Ethernet paper [13] published 10 Mbps Ethernet over twisted pair (10BaseT) 100 Megabit Ethernet 1 Gbps Ethernet 10 Gbps Ethernet 100 Gbps Ethernet (predicted) 1 Tbps Ethernet (predicted) 10 Tbps Ethernet (predicted) Table 2.1: Evolution of the Ethernet technology Since the IEEE 802.3 technology was very successful, the IEEE started efforts to define extensions for 1 Gbps, 10 Gbps networks and so on. In June 2002, an IEEE standard for 10 Gbps Ethernets was approved while the standard for 100 Gbps Ethernet is under development. The evolution of the Ethernet standards is summarized in Table 2.1. 2.2.1 Physical Layer (PHY) The physical layer of the IEEE 802.3 standard defines the transmission related properties. The following medias and topologies are defined: name 10Base2 10Base5 10BaseT 10BaseF medium coax, ø=0.25 in coax, ø=0.5 in twisted pair fiber optic max. length 200 m 500 m 100 m 2000 m max. stations 30 100 1024 1024 topology bus bus star star Table 2.2: IEEE 802.3 physical layer media and topologies The different medias have different signal propagation delays. The speed of light c is approximately c ≈ 300000 km s . The speed of the various medias can be expressed relative to the speed of light as shown in Table 2.3. medium thick coax thin coax twisted pair fiber optic signal propagation speed 0.77c ≈ 231, 000 km s 0.65c ≈ 195, 000 km s 0.59c ≈ 177, 000 km s 0.66c ≈ 198, 000 km s Table 2.3: Signal propagation speeds for various IEEE 802.3 physical layer media The 10Base5 medium, a rather thick copper coax wire, was also known as “yellow cable.” Stations were attached to a yellow cable by drilling a hole into the coax cable and sticking a needle into the heart of the cable. The 10Base2 medium, also sometimes called “cheaper net”, was easier to deploy since it was more flexible and stations were by means of so called T-connectors. The downside of this technology was that segments were more significantly limited in size and the number of stations that could be supported. The fiber optic medium on the other hand supported a much larger distance, but was rather expensive to deploy. 14 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS 2.2.2 Medium Access Layer (MAC) preamble 7 Byte start−of−frame delimiter (SFD) 1 Byte destination MAC address 6 Byte source MAC address 6 Byte length / type field 2 Byte data (network layer packet) 64−1518 Byte The IEEE 802.3/Ethernet frame format is rather simple and shown in Figure 2.3. 46−1500 Byte padding (if required) frame check sequence (FCS) 4 Byte Figure 2.3: IEEE 802.3 frame format The various fields in the frame serve the following purposes: • The seven byte preamble consists of the bit pattern 101010102 . This pattern together with the Manchester Coding technique results in a periodic signal which allows the receiver to synchronize to the speed of the sender. • The start-of-frame delimiter (SFD) has the bit pattern 101010112 . The resulting signal change at the end of the start-of-frame delimited after the preamble indicates that start of a frame. • The source and destination address fields contain six byte IEEE MAC addresses. • The two byte type/length field contains either the length of the frame (value less than 60016 ) or the identification of higher level protocol used by the data carried in the frame (value greater or equal to 60016 ). Type numbers are maintained by the IEEE and globally unique. • The data portion contains the actual payload, usually a packet of a network layer protocol. If necessary, the frame will be filled with padding bytes to achieve a minimal frame length. • The end of the packet contains a four byte CRC frame checksum (CRC-32). IEEE 802.3 uses the CSMA/CD medium access method. Figures 2.4 and 2.5 show the principal logic that is used to send and receive frames. 15 2.2. ETHERNET (IEEE 802.3) wait for frame to transmit format frame for transmission carrier sense signal on? Y N wait interframe gap time start transmission collision detected? Y N complete transmission and set status transmission done transmit jam sequence and increment # attempts attempt limit reached? set status attempt limit exceeded Y N compute and wait backoff time Figure 2.4: IEEE 802.3 MAC logic for sending frames The following parameters play a role for a classic 10 Mbps IEEE 802.3 network: • The slot time of 512 bit times equals twice the propagation delay plus some safety margin. • Between two successive frames, a minimum inter-frame gap of 96 bit times is required to ensure that frames ends are properly recognized. • The minimal length of a frame is 64 byte; the maximum length is 1518 byte. • If a collision has been detected, a special jam-signal is generated for the duration of 32 bit times. • The transmission of a frame will be (re)tried up to a maximum of 16 times in case of collisions. Once a collision has been detected by the sending station, the station waits a random number R of slot times before retrying the transmission of the frame. • On the n-th retransmission, a uniformly distributed number R is chosen from the interval [0..2k ) with k = min(n, b) and the bake-off-limit b = 10. 16 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS N incoming signal detected? Y set carrier sense signal on obtain bit sync and wait for SFD receive frame FCS and frame size OK? Y destination address matches own or group address? N N Y pass data to higher-layer protocol entity for processing discard frame Figure 2.5: IEEE 802.3 MAC logic for receiving frames There are a number of special situations which can be recognized by the MAC layer: • Received frames with a non-integral number of bytes which fail the CRC test (alignment errors). • Received frames with a legal length that fail the CRC test (frame check sequence (FCS) errors). • Frames that could not be transmitted immediately since the medium was busy (deferred transmissions). • Frames which were transmitted successfully after a single collision (single collision frames). • Frames which were transmitted successfully after multiple collisions (multiple collision frames). • Frames which could not be transmitted due to continued collisions (excessive collisions). • Collisions detected after the slot time after the start of the transmission (late collisions). Collisions that happen after the slot time are typically indications of wires that exceed the maximum allowed length. • Not further specified MAC internal errors during the transmission of a frame (internal MAC transmit errors). • Not further specified MAC internal errors during the receipt of a frame (internal MAC receive errors). • Failure to listen to the carrier signal during a transmission (carrier sense errors). • Frames that exceed the maximum length of allowed frames (frame too long errors). 17 2.2. ETHERNET (IEEE 802.3) 2.2.3 Fast-Ethernet (IEEE 802.3u) The classic 10 Mbps IEEE 802.3 standard allows a maximum wire length (including repeaters which basically amplify the signal) of 2.5km. This results in a maximum propagation delay (inclusive some detail in repeaters) of 50µs. This leads to a minimum packet length of 512 bit. The objective of the development of the Fast-Ethernet standard was a data rate of 100 Mbps without changes to the medium access mechanism. To achieve a higher bit-rate, the maximum length of the wire has to be reduced. Accordingly, the Fast-Ethernet wire length is limited to 100m. This relatively short length was acceptable since the developers envisioned the transition to star topologies with twisted pair cables. Fast-Ethernet can be used with twisted pair and fiber optic cables. The support of UTP Category 3 and 5 cables results in some specialties in the physical layer. The general advise, however, is to use Category 5 cables (or higher). name 100BaseT4 100BaseTX 100BaseFX medium twisted pair twisted pair fiber optic max. length 100 m 100 m 412 m Table 2.4: IEEE 802.3U physical layer media and topologies The 100BaseT4 media uses two twisted pairs while 100BaseTX uses a single twisted pair. 2.2.4 Gigabit Ethernet (IEEE 802.3z/802.3ab) The Gigabit-Ethernet standard specified in IEEE 802.3z initially supported fiber optic media. Support for category 5 UTP cables was later added by the IEEE 802.3ab specifications. Gigabit Ethernet can operate in half-duplex and full-duplex mode. In half-duplex mode, the protocol still uses the CSMA/CD method. To make the use of CSMA/CD possible, the slot time has been changed from 64 bytes to 512 bytes which means that packets smaller than 512 bytes are augmented with a new carrier extension field following the CRC field. When operating in full-duplex mode, the original IEEE 802.3 slot-time is used and frames are not augmented. New installations usually use Gigabit Ethernet in full duplex mode where frames can be sent and received simultaneously and where almost all the theoretically available bandwidth can be used to transmit data. name 1000BaseLX 1000BaseSX 1000BaseCX 1000BaseT medium fiber optic fiber optic coax twisted pair max. length 500 / 550 / 5000 m 220-275 / 550 m 25 m 100 m Table 2.5: IEEE 802.3z/802.3ab physical layer media and topologies 18 2.2.5 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS 10 Gigabit Ethernet (IEEE 802.3ae) The 10 Gigabit Ethernet specification IEEE 802.3ae is a full-duplex and fiber-only technology and thus does not need the CSMA/CD medium access method anymore. There are two different physical layers specified: The LAN PHY layer is for local area networks while the WAN PHY layer has an extended feature set compared to the LAN PHY layer. 2.3 Wireless LANs (IEEE 802.11) The Wireless LAN (WaveLan) standard specified in IEEE 802.11 is rather different from the IEEE 802.3 standards. It uses the MACA medium access method where small RTS/CTS frames are exchanged before the data is actually transmitted. Wireless LANs support two modes of operation. In the ad-hoc mode, stations are brought together to form a network on the fly. An election algorithm is used to elect one station which serves as the master while the other stations become slaves. The second mode assumes the presence of some fixed network access points (also sometimes called base stations) with which mobile stations can communicate. 2.4 Bluetooth LANs (IEEE 802.15) The Bluetooth standard specified in IEEE 802.15 provides a wireless network technology for rather small cells and is typically used to create wireless personal area networks. Typical bluetooth devices are PDAs or wireless headsets which can communicate with a PC or Laptop system. Due to the relatively small area covered by IEEE 802.15, it is possible to save quite some energy compared to the IEEE 802.11 family of standards. 2.5 Port Access Control (IEEE 802.1X) Port-based network access control as defined in IEEE 802.1X makes use of the physical access characteristics of IEEE 802 LAN infrastructures in order to provide a means of authenticating and authorizing devices attached to a LAN port that has point-to-point connection characteristics, and of preventing access to that port in cases in which the authentication and authorization process fails. A port in this context is a single point of attachment to the LAN infrastructure. Examples of ports in which the use of authentication can be desirable include the ports of MAC Bridges (as specified in IEEE 802.1D), the ports used to attach servers or routers to the LAN infrastructure, and associations between stations and access points in IEEE 802.11 wireless LANs. 19 2.6. BRIDGES 2.6 Bridges Multiple IEEE 802 LAN segments can be interconnected by using so called bridges. By using bridges, it does not really matter which IEEE 802 technology is used in the segments that are to be connected. Examples are big Ethernet LANs that consists of multiple Ethernet segments and also include Wireless LAN segments. 802.11 B3 10Base5 B1 B2 10Base2 100BaseT 10Base2 802.5 Figure 2.6: Bridges are used to interconnect different LAN segments Bridges (sometimes also called layer two switches) have a number of advantages: 1. Different IEEE 802 LAN technologies (e.g., Ethernet, Token Ring, WLAN) can be interconnected. 2. Geographically dispersed LAN segments can be connected by using different medias in the backbone segments (e.g., fiber) and the access segments (e.g., twisted pair). 3. Highly loaded LAN segments can be split into smaller segments which improves their performance. 4. Bridges can improve the robustness of the network since errors are better localized (due to smaller segments) and since bridges offer the possibility to have multiple redundant paths in the network. 20 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS 5. Bridges can improve to some extend the security of the network since traffic can be better restricted to the shorter local LAN segments. Bridges operate on the IEEE 802 LLC layer as shown in Figure 2.7 and this is the reason why different IEEE 802 technologies can be crossed via a bridge. Network Network Bridge IEEE 802.2 LLC IEEE 802.2 LLC IEEE 802.2 LLC IEEE 802.3 MAC IEEE 802.3 MAC IEEE 802.11 MAC IEEE 802.11 MAC IEEE 802.3 PHY IEEE 802.3 PHY IEEE 802.11 PHY IEEE 802.11 PHY Figure 2.7: IEEE 802 bridge connecting an IEEE 802.3 and an IEEE 802.11 segment Although conceptionally simple, there are some issues one has to pay attention to: • Different LAN segments usually operate at different speeds in terms of bits per second. A bridge connecting such segments should have some buffering capacity to handle traffic bursts or peaks (but of course, every puffer has a limited size). • Different LAN segments may have different maximum frame sizes. A bridge receiving a frame which exceeds the maximum frame size of the destination LAN segment can only drop that frame. • Different LAN segments which operate at different speeds may confuse timers at higher protocol levels that are not aware of the bridging situation. • Some LAN technologies support priorities while others do not. • Some LAN technologies are real-time capable while others are not. • Some LAN technologies signal the delivery of a frame to the sender which others do not. There are two basic types of bridges: source routing bridges and transparent bridges. Both of them are discussed in the next sections. 2.6.1 Source Routing Bridges Source Routing Bridges assume that a sending station can distinguish between stations attached to the local LAN segment and stations that are attached to remote LAN segments. If a frame has to be send to a station connected to a remote LAN segment, the sender first has to determine the path to the remote LAN segment before sending the frame along this path. The path to follow is 21 2.6. BRIDGES actually encoded and sent along with the frame. A special protocol is used by the stations for locate destination stations and to find suitable routes. The advantage of this approach is that one can make efficient use of the available bandwidth by utilizing redundant paths to the receiving station. The price is, however, increased complexity in the end systems that participate in a source routing bridged network. 2.6.2 Transparent Bridges (IEEE 802.1D) Transparent bridges (sometimes also called spanning tree bridges) do not need special software on the stations nor to they need a manual configuration. Instead, they adapt to their environment automatically and are thus fully transparent from the view of the network used or (to some extend) the network operator. The price for this is that not all available bandwidth in a bridged network can be used to its full potential. LAN segments are connected to transparent bridged through so called ports. The simplest of all transparent bridges has two ports. Today, it is not unusual to have bridges which have hundreds of ports that are realized on multiple modules interconnected by a high-speed backplane network. Many of the commercial products can be stacked so that a bridge can grow in the number of ports and the number of IEEE 802 technologies supported on the ports. Forwarding database Port management software MAC chipset Port 1 Station Port address number Bridge protocol entity Memory buffers MAC chipset Port 2 Figure 2.8: Internal structure of transparent bridges Bridges can receive frames on multiple ports simultaneously. It it therefore necessary to have some buffer space to hold incoming frames. The ports of a transparent bridge generally work in the promiscuous mode which allows to receive all frames on the segment and not only the frames that are destined to the bridge. A transparent bridge internally maintains a forwarding database which maps received destination MAC addresses to outgoing port numbers. • When a frame has been received by a transparent bridge, the forwarding database is checked to find an entry which matches the destination address contained in the received frame. 22 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS • If a matching entry has been found and if the port number associated with the MAC address is not equal to the port number from which the frame was received, then the frame is forwarded to the port indicated by the forwarding database entry. The frame is discarded if the port number of the forwarding database entry is identical to the port number from which the frame was received. • If no matching entry can be found in the forwarding database, then the frame is forwarded to all ports except the port from which the frame was received (flooding). • Many bridges also support a feature which allows network operators to configure that the forwarding function is disabled for certain MAC addresses. Backward Learning The forwarding database must be populated and it must adapt to changes of the network topology dynamically. One usually solves this problem by learning the current configuration from the frames received by the bridge. Learned entries in the forwarding database have a timer attached to expire these entries in case no other matching packets have been received: • The forwarding database is initialized to be empty when a bridge boots or is reinitialized. • When a bridge receives a frame which does not yet exist in the forwarding database, then it extracts the source address and determines the port number from which the frame was received. The source address and the port number are then stored in the forwarding database. The frame is then forwarded to all other ports (which also propagates information to other bridges). • Every entry in the forwarded database has a timer attached to it. Entries are automatically discarded if they have not been confirmed by additional received frames within a certain time interval (soft state). • The aging of unused entries reduces the size of the forwarding table and allows bridges to react to topology changes dynamically (after a short delay). • The backward learning algorithm only works if the topology is a strict tree and does not contain cycles. In case of multiple paths between LAN segments, it is possible that entries in the forwarding database are overwritten periodically. This behavior of a network is not stable in such a situation. Spanning Trees Bridged networks which do not have a loop-free tree structure cause problems since frames might travel endlessly in a loop (ping-pong) when using backward learning alone. Transparent bridges therefore construct a spanning tree in these cases which is used to restrict how frames are forwarded. The spanning tree protocol requires a unique identification of the bridges involved. The so called bridge identifier consists of one of the MAC addresses (six bytes) of a bridge plus a priority value (two bytes). The priority value can be set administratively to influence the spanning trees computed with the spanning tree protocol. The spanning tree protocol executes in the following steps: 2.7. VIRTUAL LANS (IEEE 802.1Q) 23 1. In the first step, the root of the spanning tree is selected (root bridge). The root bridge is the bridge with the highest priority and the smallest bridge address. The root of the spanning tree is periodically broadcasted and will be recomputed as needed. 2. In the second step, the costs for all possible paths from the root bridge to the various ports on the bridges is computed (root path cost). Every bridge determines which local port is used to reach the root bridge at the lowest costs. The selected port is called the root port. 3. In the third step, the designated bridge is determined for each segment. The designated bridge of a segment is the bridge which connects the segment to the root bridge with the lowest costs on its root port. At equal costs, the bridge with the lowest bridge identifier wins. The port used to reach designated bridges are called designated ports. 4. Finally, all ports are blocked which are not designated ports. The resulting active topology is a spanning tree. The spanning tree protocol uses so called BPDUs to distribute information. A BPDU has the structure shown in Figure 2.9. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Protocol Identifier | Version | BPDU Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flags | = +-+-+-+-+-+-+-+-+ Root ID + = = + +-----------------------------------------------+ = | Root Path Costs = +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ = | = +-+-+-+-+-+-+-+-+ Bridge ID + = = + +-----------------------------------------------+ = | Port ID | Message | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Age | Maximum Age | Hello | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timer | Forward Delay | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2.9: Bridge PDU (BPDU) format 2.7 Virtual LANs (IEEE 802.1Q) Virtual LANs (virtual bridged lans, VLANs) emulate a virtual LAN segment on top of a complex IEEE 802 bridged network. 24 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS B1 B2 B0 Figure 2.10: Virtual LANs VLANs allow to separate the traffic on an IEEE 802 network which has several advantages: • A station connected to a certain VLAN only sees frames that belong to the VLAN. • VLANs can reduce the network load. In particular, frames that are targeted to all stations (broadcasts) will only be delivered to the stations connected to the VLAN. • It is possible that a station is a member of multiple VLANs simultaneously. This allows to use for example a central server from multiple VLANs. • By assigning stations to VLANs, it is possible to create logical LAN topologies that are independent of the underlying physical LAN topology. A VLAN is identified by a VLAN identifier (1..4094) and realized by VLAN supporting bridges. The assignment of bridge ports to VLANs can be done in different ways: • Port based VLANs: The ports of a bridge are assigned administratively to the various VLANs. A single port can in general participate in multiple VLANs. • MAC address based VLANs: The MAC addresses of the stations are assigned administratively to the various VLANs. With this scheme, it does not matter on which port a given station connects to a bridge. • Protocol based VLANs: Frames are assigned to VLANs by inspecting the payload contained in the frames. This technique allows to create VLANs for e.g., Appletalk or IPX frames. • Multi-cast group based VLANs: VLANs are defined for all members of a certain multi-cast group. This requires a multi-cast group membership protocol to be effective. 25 2.7. VIRTUAL LANS (IEEE 802.1Q) 7 Byte start−of−frame delimiter (SFD) 1 Byte destination MAC address 6 Byte source MAC address 6 Byte tag protocol identifier 2 Byte priority CFI preamble vlan identifier length / type field 2 Byte 64−1522 Byte On links that carry frame which belong to different VLANs, it is necessary to tag the frames with the VLAN identifier. In the case of Ethernet frames, an new field called the tag header is introduced right after the destination and source addresses, as shown in Figure 2.11. data (network layer packet) 46−1500 Byte padding (if required) frame check sequence (FCS) 4 Byte Figure 2.11: IEEE 802.3 tagged frame format The introduction of VLAN tags has some implications: • Tagged frames can exceed the maximum frame lengths accepted by stations which do not support VLANs. • The IEEE 802.1Q standard generally requires that frames which exceed the maximum allowed length are discarded. • In the case of IEEE 802.3 frames, an extension of the original frame of four bytes has been granted (which changes the maximal length of a frame from 1518 bytes to 1522 bytes). The IEEE 802.1Q standard also introduces the Generic Attribute Registration Protocol (GARP), which can among other things propagate information about VLAN membership of individual ports. This information can be used by VLAN enabled devices to suppress frames for VLANs which currently have no members. 26 2.8 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS LAN Priorities (IEEE 802.1D) The 1998 revision of the IEEE 802.1D standard introduces additional support for priorities and quality of service support. (The additions were developed under the name IEEE 802.1p and are still referred to by this name.) The original IEEE 802.3 and IEEE 802.11 frame formats do not allow to communication priorities. When using IEEE 802.1Q VLAN frames, priorities can be encoded in the 3-bit priority field of the four byte VLAN tag. The core idea behind 802.1D priority extensions is to support bridges that have multiple output queues for each port. Within a bridge, frames are assigned to certain traffic classes based on the user priority (usually carried with the frame) and the access priority (which is the priority associated with the media access mechanism). The traffic class of a frame is then used to select the queue where the frame is queued for transmission. Note that a bridge must preserve the ordering of unicast frames with a given combination of source and destination addreses and the order of multicast frames with a given destination address. Chapter 3 Internet Network Layer The Internet Protocol(s) developed and standardized by the IETF currently dominate the network layer in data communication networks and they reach out into voice and multi-media communication networks. The widely deployed version of the Internet Protocol (IP) is version 4 (IPv4) while version 6 (IPv6) is currently gaining more deployment and thus practical relevance. This chapter is centered around these two protocols. It also discusses the protocols which support the network layer such as routing protocols or protocols which aim to automate the end system configuration process. 3.1 Fundamentals First, we consider some fundamentals which are important to understand the design of the Internet protocols. 3.1.1 Evolution of the Internet In the mid 1970s, the Defense Advanced Research Project Agency (DARPA) of the USA started projects to develop inter-networking technologies. These projects led to the ARPANET, a packet switched network running on top of leased lines. The ARPANET later became a backbone network connecting all the major Universities in the USA. An implementation of the Internet protocols became a part of the BSD Unix operating system in the early 1980. The BSD Unix system became very popular in research organizations and that made the Internet protocols deployed in research environments. The integration of the Internet protocols into the BSD Unix also led to the development of the so called socket application programming interface (API) which became the defacto standard operating system-level API to write networked applications. In 1983, the ARPANET is split into the ARPANET research network and the MILNET for use by the US militaries. The ARPANET research network becomes the NSFNET in 1986, which is now funded by the National Science Foundation of the USA. In 1990, the NSFNET backbone turns into the ANSNET operated jointly by MERIT, MCI and IBM. In the early 1990s, the World Wide Web is born at CERN in Switzerland. More details about the evolution of the Internet can be found at the Web page of the Internet Society1 . 1 http://www.isoc.org/ 27 28 CHAPTER 3. INTERNET NETWORK LAYER 3.1.2 Internet Design Principles There are a number of fundamental principles which were followed during the development of Internet protocols. Some of the principles described in RFC 1958 [15] are: • The first principle is that connectivity is its own reward. The idea here is that connectivity across different link and transmission technologies is more valuable than any individual application such as email or the World-Wide Web. The technique to realize connectivity is to realize an inter-networking layer which puts only very basic requirements on the underlying link and transmission technologies. • All functions which require knowledge of the state of end-to-end communication should be realized at the endpoints and not inside of the network (end-to-end argument). In other words, end-to-end protocol design should not rely on the maintenance of state (i.e. information about the state of the end-to-end communication) inside the network. Such state should be maintained only in the endpoints, in such a way that the state can only be destroyed when the endpoint itself breaks (known as fate-sharing). Of course, to perform its services, the network maintains some state information such as routes and the like. This state must be self-healing; adaptive procedures or protocols must exist to derive and maintain that state, and change it when the topology or activity of the network changes. The volume of this state must be minimized, and the loss of the state must not result in more than a temporary denial of service given that connectivity exists. Manually configured state must be kept to an absolute minimum. • There is no central instance which controls the Internet and which is able to turn it off. • Addresses should uniquely identify endpoints. Dynamic changes within the network should be possible without having to change the identification of the end-systems. • Intermediate systems should be stateless wherever possible. If state is necessary, it should be attached to a timer and not static (soft state). • To increase interoperability, implementations should be liberal in what they accept and stringent in what they generate. Interoperability is more important than strict correctness. • Keep it simple. When in doubt during design, choose the simplest solution. It is also important to consider that protocols sometimes show effects when used on a larger scale that can not be observed on small scales. This often comes from interactions between layers or features. One approach to address these issues is to keep complexity down to a minimum by following the simplicity principle discussed further in RFC 3439 [16]. 3.1.3 Basic Terminology It is necessary to introduce some terminology. These lecture notes use the terminology as defined in RFC 2460 [17]. Some older books and documents do not necessarily use the same terminology and it is thus sometimes necessary to mentally map terms when reading other documents. • A node is a device which implements an Internet Protocol (such as IPv4 or IPv6). 3.1. FUNDAMENTALS 29 • A router is a node that forwards IP packets not addressed to itself. • A host is any node which is not a router. • A link is a communication channel below the IP layer which allows nodes to communicate with each other (e.g., an Ethernet). • The neighbors is the set of all nodes attached to the same link. • An interface is a node’s attachement to a link. • An IP address identifies an interface or a set of interfaces. • An IP packet is a bit sequence consisting of an IP header and the payload. • The link MTU is the maximum transmission unit, i.e., maximum packet size in octets, that can be conveyed over a link. • The path MTU is the the minimum link MTU of all the links in a path between a source node and a destination node. 3.1.4 Autonomous Systems The global Internet consists of a set of so called autonomous systems which are inter-connected. An autonomous system (AS) is basically a set of routers and networks under the same administration. • An autonomous system is identified by a number, the so-called AS number. The number space is currently restricted to 16 bits, which is becoming problematic. • IP packets are forwarded between autonomous systems over paths that are established by an Exterior Gateway Protocol. The internal structure of an autonomous system is irrelevant for the protocol establishing paths between autonomous systems. • Within an autonomous system, IP packets are forwarded over paths that are established by an Interior Gateway Protocol. The introduction of autonomous systems and the distinction between interior and exterior routing protocols implies a two-level Internet routing architecture. Autonomous systems can be classified as follows: • A stub AS only has a single connection to another AS. • A multihomed AS has multiple connections to other ASes but does not forward transit traffic. • A transit AS has multiple connections to other ASes and carries local as well as transit traffic. 30 CHAPTER 3. INTERNET NETWORK LAYER 3.1.5 Internet Address Scopes Internet addresses do not all have the same scope of uniqueness. While most IP addresses have global scope, some addresses are only guaranteed to be unique on a certain interface while others are only guaranteed too be unique on a certain link. The scope of an Internet address is a topological span within which the address may be used as a unique identifier for an interface or a set of interfaces. A scope zone, or simply a zone, is a concrete connected region of topology of a given scope. Note that a zone is a particular instance of a topological region, whereas a scope is the size of a topological region. --------------------------------------------------------------| a node | | | | | | /--link1--\ /--------link2--------\ /--link3--\ /--link4--\ | | | | /--intf1--\ /--intf2--\ /--intf3--\ /--intf4--\ /--intf5--\ | --------------------------------------------------------------: | | | | : | | | | : | | | | (imaginary ================= a pointa loopback an Ethernet to-point tunnel link) link Since Internet addresses on devices that connect multiple zones are not necessarily unique, an additional zone index is needed on these devices to select an interface or a set of interfaces. 3.2 Internet Protocol Version 4 (IPv4) The Internet Protocol version 4 (IPv4) was standardized in 1981 and is documented in RFC 791 [18]. The IPv4 protocol is the basis of today’s global Internet. The original IPv4 specification has been adopted to emerging requirements during the last 20 years. The following description of IPv4 describes the current interpretation of IPv4. 3.2.1 IPv4 Addresses The principal structure and the textual representation of IPv4 addreses has already been introduced in chapter 1. • For forwarding purposes, IPv4 addresses are divided into a part which identifies a network (netid) and a part which identifies an interface of a node within that network (hostid). • The number of bits of an IPv4 address which identifies the network is called the address prefix. The address prefix is commonly written as a decimal number, appended to the usual IPv4 address notation by using a slash (/) as a separator (e.g., 192.0.2.0/24). 31 3.2. INTERNET PROTOCOL VERSION 4 (IPV4) • Older documents use a so called netmask which is a bitfield of the size of an IPv4 address which gives the network identifies by performing a logical bitwise and operation with an IPv4 address (e.g., 192.0.2.0 & 255.255.255.0). Not all possible IPv4 addresses can be used in the global Internet without restrictions. Some addresses are reserved or have special semantics attached to them, as described in RFC 3330 [19]: Address Block 0.0.0.0/8 10.0.0.0/8 14.0.0.0/8 24.0.0.0/8 39.0.0.0/8 127.0.0.0/8 128.0.0.0/16 169.254.0.0/16 172.16.0.0/12 191.255.0.0/16 192.0.0.0/24 192.0.2.0/24 192.88.99.0/24 192.168.0.0/16 198.18.0.0/15 223.255.255.0/24 224.0.0.0/4 240.0.0.0/4 Present Use ”This” Network Private-Use Networks Public-Data Networks Cable Television Networks Class A Subnet Experiment Loopback Reserved by IANA Link Local Private-Use Networks Reserved by IANA Reserved by IANA Test-Net / Documentation 6to4 Relay Anycast Private-Use Networks Network Interconnect / Device Benchmark Testing Reserved by IANA Multicast Reserved for Future Use Reference [RFC1700] [RFC1918] [RFC1700] [RFC3330] [RFC1797] [RFC1700] [RFC3330] [RFC3330] [RFC1918] [RFC3330] [RFC3330] [RFC3330] [RFC3068] [RFC1918] [RFC2544] [RFC3330] [RFC3171] [RFC1700] • Adresses for private networks, which are not routed through the public global Internet, can be taken from the address blocks 10.0.0.0/8 und 192.168.0.0/16 as specified in RFC 1918 [20]. • Test addresses or addresses that are used solely for documentation purposes can be taken from the address block 192.0.2.0/24. • Address from the address block 0.0.0.0/8 identify a sender which is not yet fully configured (typically 0.0.0.0). • The address block 127.0.0.1/8 identifies the local node, also called the loopback network. • The special address 255.255.255.255 causes a local broadcast. 3.2.2 IPv4 Packet Format IPv4 packets have the following structure as specified in RFC 791 [18]: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| IHL |Type of Service| Total Length | 32 CHAPTER 3. INTERNET NETWORK LAYER +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Time to Live | Protocol | Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Remarks: • The Version field contains the version number (4). • The length of the protocol header is stored in the Internet Header Length (IHL) field. The length is counted in the number of 4 byte words. The minimum header length is 5 (which corresponds to 20 bytes) and the maximum header length is 15 (which corresponds to 60 bytes). • The interpretation of the Type of Service (TOS) field has changed over time. The current interpretation of this field uses six bit as the Differentiated Services Code Point (DSCP) and two bits for explicit congestion notifications (ECN), as specified in RFC 2474 [21], RFC 3168 [22] and RFC 3260 [23]. 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | DS FIELD, DSCP | ECN FIELD | +-----+-----+-----+-----+-----+-----+-----+-----+ • The length of the IPv4 packet (including the protocol header) is stored in the Total Length field. Since this is a 16-bit field, IPv4 packets can have a maximum length of 65535 bytes. • The fields Identification, Flags and Fragment Offset support fragmentation of IPv4 packets. The Identification field contains the same value for all fragments of an IPv4 packet. The Fragment Offset field contains the relative position of a fragment of an IPv4 packet (counted in 64-bit words). The flag More Fragments (MF) is set if more fragments follow. The flag Don’t Fragment (DF) can be set to indicate that the sender does not want fragmentation, in which case an IPv4 packet will be discarded if it does not fit into the maximum frame size of the outgoing link of a node. Note that IPv4 allows fragments to be further fragmented without intermediate reassembly. • The Time to Live (TTL) field is used to limit the lifetime of an IPv4 packet. The lifetime is usually measured in the number of hops passed rather than a period in time. Every router forwarding an IPv4 packet decrements this field and the packet is discarded once the value of the field becomes zero. • The Protocol fiel identifies the protocol contained in the IPv4 packet, in most cases one of the Internet transport protocols. 33 3.2. INTERNET PROTOCOL VERSION 4 (IPV4) • The Header Checksum field contains the Internet checksum computed over the header. • The Source Address and Destination Address field contain the source and destination address of the packet. • There are a number of options which can be used to control the forwarding of a packet or which cause routers to append forwarding information to the protocol header. Most of these options are practically irrelevant since the remaining 40 bytes are usually not enough for using these options. 3.2.3 IPv4 Forwarding Every node maintains a forwarding table (also sometimes called the forwarding information base) which is used to direct IPv4 packets closer to their destination [24]. • The forwarding table realizes a mapping of the network prefix to the next node (next hop) and the local interface used to reach the next node. • For every IP packet, the entry in the forwarding table has to be found with longest matching network address prefix (longest prefix match). The following example shows on a simple network topology the contents of the various forwarding tables involved: R1 Prefix 0.0.0.0/0 134.169.34.0/24 134.169.0.0/16 H1 134.169.2.1 Next Hop 134.169.2.1 134.169.246.34 134.169.9.10 Interface eth0 eth0 eth0 134.169.9.10 134.169.0.0/16 134.169.246.34 Prefix 0.0.0.0/0 134.169.0.0/16 134.169.34.0/24 Next Hop 134.169.2.1 134.169.246.34 134.169.34.12 Interface eth0 eth0 eth1 H2 R2 134.169.34.12 Prefix 0.0.0.0/0 134.169.34.0/24 Next Hop 134.169.34.12 134.169.34.1 Interface eth0 eth0 134.169.34.1 134.169.34.0/24 Figure 3.1: Example for IPv4 forwarding Variations and extensions of the basic forwarding model: • A node has multiple forwarding tables. Information contained in fields of the incoming IP packet (e.g., the DSCP value) is used to select one out of many forwarding tables to forward the packet. • Instead of maintaining the forwarding table(s) in a central place of a router, it is possible to store at least frequently used parts on the router interfaces. This approach allows to parallize forwarding lookups at the cost of more complex updates if routing tables change. 34 CHAPTER 3. INTERNET NETWORK LAYER • Another approach to increase performance is to use chaches for frequently used destination addresses. The performance of IP address lookups is crucial for high-speed IP routers: • Forwarding tables can become very large (around 100000 entries on backbone routers have been reported in January 2001). • One technique used to reduce the size of forwarding tables is called address aggregation. If a router has multiple forwarding table entries with a common prefix which point to the same interface, the router can aggregate these entries into a single entry with a shorter prefix length. Exceptions can still be handled by having some entries with a longer prefix. • Due to the grows in the number of packets per second a router has to handle and the grows of the forwarding tables, it is crucial to design lookup algorithms that scale well in the number of addresses stored in a forwarding table. Note that routing updates occur frequently in backbone routers and thus update operations must be reasonable fast as well. • Large forwarding tables are usually represented as a tries so that the complexity of lookup operations depends on the distribution of the length of network prefixes and not on the total number of table entries [25]. A trie is a tree-based data structure allowing the organization of prefixes on a digital basis by using the bits of prefixes to direct the branching. • The usage of optimized tree representations, usually implemented in hardware, provides the performance that is needed to handle IP on very high speed links. See [26] for a good survey on fast IP address lookup algorithms. 3.2.4 IPv4 Error Handling (ICMPv4) The Internet Control Message Protocol (ICMP) as specified in RFC 792 [27] is used to inform nodes about problems encountered while forwarding IP packets. It also introduces messages which can be used to perform simple tests. ICMP messages are transported in the payload of ordinary IP packets. In the following, a selection of ICMP message formats will be discussed. ICMP messages in general contain a checksum which is computed over the ICMP message in order to detect some bit errors in ICMP messages. Echo Request/Reply 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identifier | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data ... +-+-+-+-+- 3.2. INTERNET PROTOCOL VERSION 4 (IPV4) 35 • The ICMP echo request message (type = 8, code = 0) asks the destination node to return an echo reply message (type = 0, code = 0) to the sender of the echo request message. • The Identifier and Sequence Number fields are used by the sender to correlate incoming replies with previously sent requests. • The data field may contain additional data or just fill bytes in order to bring the IP packets to a certain size. Unreachable Destinations 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Internet Header + 64 bits of Original Data Datagram | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Type field has the value 3 for all unreachable destination messages. • The Code field indicates why a certain destination is not reachable: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Net Unreachable Host Unreachable Protocol Unreachable Port Unreachable Fragmentation Needed and Don’t Fragment was Set Source Route Failed Destination Network Unknown Destination Host Unknown Source Host Isolated Communication with Destination Network is Administratively Prohibited Communication with Destination Host is Administratively Prohibited Destination Network Unreachable for Type of Service Destination Host Unreachable for Type of Service Communication Administratively Prohibited Host Precedence Violation Precedence cutoff in effect • The data field contains the beginning of the packet which caused the ICMP unreachable destination message. Redirect 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 36 CHAPTER 3. INTERNET NETWORK LAYER +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Router Internet Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Internet Header + 64 bits of Original Data Datagram | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Type field has the value 5 for all redirect messages. • The Code field indicates which type of packets should be redirected: 0 1 2 3 Redirect datagrams for the Network. Redirect datagrams for the Host. Redirect datagrams for the Type of Service and Network. Redirect datagrams for the Type of Service and Host. • The Router Internet Address field contains the IP address of the router to which packets should be redirected. • The data field contains the beginning of the packet which caused the ICMP redirect message. 3.2.5 MTU Path Discovery Fragmentation of IPv4 packets is problematic for several reasons [28]: • The receiver must buffer fragments until all fragments have been received. However, it is not useful to keep fragments in a buffer indefinately. Hence, the TTL field of all buffered packets will be decremented once per second and fragments are dropped when the TTL field becomes zero. • The loss of a fragment causes in most cases the sender to resend the original IP packet which in most cases gets fragmented as well. Hence, the probability of transmitting a large IP packet successfully goes quickly down if the loss rate of the network goes up. • Since the Identification field identifies fragments that belong together and the number space is limited, one cannot fragment an arbitrary large number of packets. An obvious solution for the problem is to cause the sender never to generate packets that are larger than the path MTU and thus never have to be fragmented [29]. To make this simple solution work, the sender has to be able to learn the path MTU: • The sender sends IPv4 packets with the DF flag turned on. • A router which has to fragment a packet with the DF flag turned on drops the packet and sends an ICMP message back to the sender which also includes the local maximum link MTU. • Upon receiving the ICMP message, the sender adapts his estimate of the path MTU and retries. • Since the path MTU can change dynamically (since the path can change), a once learned path MTU should be verified and adjusted periodically. 3.2. INTERNET PROTOCOL VERSION 4 (IPV4) 37 • Not all routers send necessarily the local link MTU. In this cases, the sender usually tries typical MTU values, which is usually faster than doing a binary search. 3.2.6 IPv4 over IEEE 802.3 IPv4 packets are sent in the payload of IEEE 802.3 frames according to the specification in RFC 894 [30]. • IPv4 packets are identified by the value 0x800 in the IEEE 802.3 type field. • According to the maximum length of IEEE 802.3 frames, the maximum link MTU is 1500 byte. • The mapping of IPv4 addresses to IEEE 802.3 addresses is table driven. Entries in so called mapping tables (sometimes also called address translation tables) can either be statically configured or dynamically learned. 3.2.7 IPv4 Adress Translation (ARP, RARP) The Address Resolution Protocol (ARP) defined in RFC 826 [31] allows an IP node to determine the link-layer address of a neighboring node on a broadcast network. The fundamental principle here is to broadcast a message asking for the translation of an IP address to a link-layer to all stations attached to a broadcast network. Since the message is broadcasted, it will also reach the node which has the IP address assigned to one of its interfaces. This node can thus respond by sending a unicast message back to the node which asked the question. Subsequently, an extension was defined which allows to perform reverse address resolutions. The Reverse Address Resolution Protocol defined in RFC 903 [32] resolves a node’s hardware address to an IP address. In case of IPv4 addresses and IEEE 802.3 addresses, the following message format is used for both ARP and RARP: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hardware Type | Protocol Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | HLEN | PLEN | Operation | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sender Hardware Address (SHA) = +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ = Sender Hardware Address (SHA) | Sender IP Address (SIP) = +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ = Sender IP Address (SIP) | Target Hardware Address (THA) = +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ = Target Hardware Address (THA) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Target IP Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 38 CHAPTER 3. INTERNET NETWORK LAYER • The ARP message format is not aligned to 32-bit word boundaries in case of IPv4 addresses and IEEE 802 MAC addresses. • The Hardware Type field identifies the address type used on the link-layer (the value 1 is used for IEEE 802.3 MAC addresses). • The Protocol Type field identifies the network layer address type (the value 0x800 is used for IPv4). • ARP/RARP packets use the type value 0x806 in the IEEE 802.3 frame. • The Operation field contains the message type: ARP Request (1), ARP Response (2), RARP Request (3), RARP Response (4). • The sender fills, depending on the request type, either the Target Hardware Address (RARP) field or the Target IP Address (ARP) field. • The responding node swaps the Sender/Target fiels and fills the empty fields with the requested information. 3.2.8 Automatic Configuration (DHCP) The Dynamic Host Configuration Protocol (DHCP) defined in RFC 2131 [33] allows nodes (DHCP clients) to retrieve configuration parameters dynamically from a central configuration server (DHCP server). A binding is a collection of configuration parameters, including at least an IP address, associated with or bound to a DHCP client. Bindings are managed by DHCP servers. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | op (1) | htype (1) | hlen (1) | hops (1) | +---------------+---------------+---------------+---------------+ | xid (4) - transaction id | +-------------------------------+-------------------------------+ | secs (2) | flags (2) | +-------------------------------+-------------------------------+ | ciaddr (4) - client IPv4 address | +---------------------------------------------------------------+ | yiaddr (4) - your (client) IPv4 address | +---------------------------------------------------------------+ | siaddr (4) - next server IPv4 address | +---------------------------------------------------------------+ | giaddr (4) - relay agent IPv4 address | +---------------------------------------------------------------+ | | | chaddr (16) - client hardware address of | | type htype and length hlen | | | +---------------------------------------------------------------+ | | 39 3.2. INTERNET PROTOCOL VERSION 4 (IPV4) | sname (64) - server name (null terminated) | | | +---------------------------------------------------------------+ | | | file (128) - boot file name (null terminated) | | | +---------------------------------------------------------------+ | | | options (variable) | +---------------------------------------------------------------+ The DHCP protocol supports the following message types: • The DHCPDISCOVER message is a broadcast message which is sent by DHCP clients to locate DHCP servers. • The DHCPOFFER message is sent from a DHCP server to offer a client a set of configuration parameters. • The DHCPREQUEST is sent from the client to a DHCP server as a response to a previous DHCPOFFER message, to verify a previously allocated binding or to extend the lease of a binding. • The DHCPACK message is sent by a DHCP server with some additional parameters to the client as a positive acknowledgement to a DHCPREQUEST. • The DHCPNAK message is sent be a DHCP server to indicate that the client’s notion of a configuration binding is incorrect. • The DHCPDECLINE message is sent by a DHCP client to indicate that parameters are already in use. • The DHCPRELEASE message is sent by a DHCP client to inform the DHCP server that configuration parameters are no longer used. • The DHCPINFORM message is sent from the DHCP client to inform the DHCP server that only local configuration parameters are needed. A typical exchange between a client and two candidate servers is displayed below. Server (not selected) Client Server (selected) v v v | | | | Begins initialization | | | | | _____________/|\____________ | |/DHCPDISCOVER | DHCPDISCOVER \| | | | Determines | Determines configuration | configuration 40 CHAPTER 3. INTERNET NETWORK LAYER | | | |\ | ____________/ | | \________ | /DHCPOFFER | | DHCPOFFER\ |/ | | \ | | | Collects replies | | \| | | Selects configuration | | | | | _____________/|\____________ | |/ DHCPREQUEST | DHCPREQUEST\ | | | | | | Commits configuration | | | | | _____________/| | |/ DHCPACK | | | | | Initialization complete | | | | . . . . . . | | | | Graceful shutdown | | | | | |\ ____________ | | | DHCPRELEASE \| | | | | | Discards lease | | | v v v See RFC 2131 [33] for a complete state diagram and a complete description of the possible transitions. The options field of DHCP messages can contain various configuration options. Options may have a fixed length of variable length. All options begin with a tag octet, usually followed by a length field and the actual value (tag-length-value, TLV). An initial set of options such as options to configure a list of routers, a list of name servers and so on is defined in RFC 2132 [34]. Some security aspects related to the lack of authentication within DHCP are discussed in RFC 3118 [35] and a proposal is made to provide delayed authentication, which is still subject to denial of service attacks. 3.3 Internet Protocol Version 6 (IPv6) In the early 1990s, it became clear that the IPv4 address space is not large enough to support the expected growth of the Internet. Work started to define version 6 of the IP protocol (IPv6). The current core IPv6 specification was published in 1998 in RFC 2460 [17] as Draft Standard. Implementations of IPv6 are available on almost all platforms. The primary goals driving the development of IPv6 were: • Increase of the address space from 32 bit to 128 bit. • Simplification of protocol headers for the most common cases to reduce processing costs and bandwidth consumption in the normal cases. • Improved support for protocol extensions and options. 41 3.3. INTERNET PROTOCOL VERSION 6 (IPV6) • Capability to mark packets that belong to particular traffic flows for which the sender requests special handling. • Authentication and privacy capabilities to support authentication, data integrity and optional data confidentiality of IPv6 packets. • Integrated automatic end-system configuration capabilities. 3.3.1 IPv6 Addresses IPv6 addresses are 128-bit identifiers for interfaces and sets of interfaces. The details are defined in RFC 3513 [8]. There are three types of IPv6 addresses: • A unicast address is an identifier for a single interface. A packet sent to a unicast address is delivered to the interface identified by that address. • A anycast address is an identifier for a set of interfaces. A packet sent to an anycast address is delivered to one of the interfaces identified by that address. • A multicast address is an identifier for a set of interfaces. A packet sent to a multicast address is delivered to all interfaces identified by that address. The type of an IPv6 address is identified by the high-order bits of the address: Address type Unspecified Loopback Multicast Link-local unicast Site-local unicast Global unicast Binary prefix 00...0 (128 bits) 00...1 (128 bits) 11111111 1111111010 1111111011 (everything else) IPv6 notation ::/128 ::1/128 FF00::/8 FE80::/10 FEC0::/10 Table 3.1: IPv6 address type identification Anycast addresses are taken from the unicast address spaces (of any scope) and are not syntactically distinguishable from unicast addresses. Interface Identifiers Interface identifiers in IPv6 unicast addresses are used to uniquely indentify interfaces on a link. For all unicast addresses, except those that start with binary 000, interface identifiers are required to be 64 bits long and to be constructed in modified EUI-64 format. The modified EUI-64 format can be obtained from IEEE 802 MAC addresses by inserting two octets with the hexadecimal values 0xFF and 0xFE in the middle of the 48-bit MAC address. A 48-bit IEEE MAC address with global scope has the following format: |0 1|1 3|3 4| |0 5|6 1|2 7| +----------------+----------------+----------------+ |cccccc0gcccccccc|ccccccccmmmmmmmm|mmmmmmmmmmmmmmmm| +----------------+----------------+----------------+ 42 CHAPTER 3. INTERNET NETWORK LAYER The c bits are the assigned company identification, 0 is the universal/local bit to indicate global scope, the g bit is the individual/group bit, and m bits are the manufacturer selected extension identifier. The corresponding modified EUI-64 identifier has the following format: |0 1|1 3|3 4|4 6| |0 5|6 1|2 7|8 3| +----------------+----------------+----------------+----------------+ |cccccc1gcccccccc|cccccccc11111111|11111110mmmmmmmm|mmmmmmmmmmmmmmmm| +----------------+----------------+----------------+----------------+ With this transformation, it is possible to compute a link local IPv6 address for each physical IEEE 802 MAC interface. While the automatic computation of interface identifier from MAC addresses is a simple way to construct link local and global IPv6 addresses, some people have concerns that these IPv6 addresses can be used to track mobile nodes used in different networks. There are two approaches to address this concern: The first approach is to use DHCP instead of IPv6 auto-configuration to assign IPv6 addresses. The other approach documented in RFC 2893 [36] is to generate a pseudo-random sequence of interface identifiers via a oneway hash function which depends on a random component and the globally unique interface identifier (where available). The pseudo-random interface identifiers are then only used for a certain period of time. Global Unicast Addresses The general format for IPv6 global unicast addresses is as follows: | n bits | m bits | 128-n-m bits | +------------------------+-----------+----------------------------+ | global routing prefix | subnet ID | interface ID | +------------------------+-----------+----------------------------+ The global routing prefix is a typically hierarchically structured value assigned to a site (a cluster of subnets/links), the subnet ID is an identifier of a link within the site. IPv6 Addresses with Embedded IPv4 Addresses There is a special IPv6 address space which which contains the complete IPv4 address space. The so called mapped IPv4 addresses where invented to make the transition from IPv4 to IPv6 networks easier. There is an ongoing controversy whether this is actually the case. | 80 bits | 16 | 32 bits | +--------------------------------------+--------------------------+ |0000..............................0000|0000| IPv4 address | +--------------------------------------+----+---------------------+ Link-Local Unicast Addresses Link-local unicast addresses are assigned automatically and guaranteed to be unique on the link attached to an interface. | 10 | | bits | 54 bits | 64 bits | +----------+-------------------------+----------------------------+ |1111111010| 0 | interface ID | +----------+-------------------------+----------------------------+ 3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 3.3.2 43 IPv6 Packet Format IPv6 packets have the following structure, as specified in RFC 2460 [17]: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| Traffic Class | Flow Label | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Length | Next Header | Hop Limit | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Source Address + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Destination Address + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Version field contains the version number 6. • The Traffic Class field contains in the current interpretation the Differentiated Services Code Point (DSCP) as well as two bits for explicit congestion notification [21, 22, 23]. 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | DS FIELD, DSCP | ECN FIELD | +-----+-----+-----+-----+-----+-----+-----+-----+ • The Flow Label field allows to mark packets transmitted from a source address to a destination address which belong to a certain traffic flow (e.g., all packets that belong to a certain voice call). The motivation for this field is that routers can handle packets that belong to a certain flow in a specific way. • The Payload Length field contains the length of the payload following the IPv6 protocol header. Note that this field is different from the IPv4 Total Length field. • The Next Header field identifies the type of the payload following the header. This is roughly equivalent to the IPv4 Protocol field. Note, however, that IPv6 uses a daisy-chain of IPv6 headers to realize IPv6 options as discussed below. • The Hop Limit field is used to limit the lifetime of IPv6 packets. Every router which forwards and IPv6 packet decrements this field and the packet is discarded if the value reaches zero. • The Source Address and Destination Address fields contain the 128-bit source and destination addresses. 44 CHAPTER 3. INTERNET NETWORK LAYER 3.3.3 IPv6 Extensions Compared to the IPv4 packet formant, the IPv6 packet format is much simpler. This has been achieved by moving some functionality into so called extension headers which can be carried in a daisy chain between the IPv6 protocol header and the actual payload. If a node does not understand an extension header, it has to discard the whole packet. Parameters, which can be ignored by implementations, are called options and they are carried in special extension headers. Routing Extension Header The Routing Header (RH) is an extension header that can be used by the sender to specify one or more nodes that must be visited on the way to the destination. The RH extension header as defined in RFC 2460 [17] has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len | Routing Type | Segments Left | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . . . Type-Specific Data . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Next Header field identifies the type of the payload following the RH extension header. • The Hdr Ext Len field contains the length of the RH counted in 64-bit words minus 1. • The Routing Type field identifies a certain variant of the RH and the semantics of the field Type-Specific Data. At the time of this writing, only a single routing type has been defined. • The Segments Left field indicates the number of remaining routing segments. • The contents of the Type-Specific Data field depends on the value of the Routing Type field. This field contains, under the currently defined Routing Type, 32 unused bits followed by a sequence of 128-bit fields where each 128-bit field contains an IPv6 address. When an IPv6 packet reaches the destination and if there are remaining segments, then the next routing address is copied into the destination address field, the number of remaining segments is decremented and the packet if forwarded to the new destination address. Fragment Extension Header IPv6 assumes that every link has a link MTU of at least 1280 bytes [17]. Links that only support smaller MTUs must provide fragmentation and reassembly services below the IPv6 layer. Simple IPv6 implementations which do not perform MTU path discovery must restrict themself to packet which do not exceed 1280 bytes. Packets, which are bigger than the path MTU, can be fragmented by using the Fragment Header (FH) extension. Only IPv6 source nodes are allowed to fragment IPv6 packets. In contrast to IPv4, routers are not allowed to fragment packets. The FH extension header as defined in RFC 2460 [17] has the following format: 3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 45 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Reserved | Fragment Offset |Res|M| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Next Header field identifies the type of the payload following the FH extension header. • The Fragment Offset field defines the relative position of the fragment (counted in 64-bit words) in the original IPv6 packet. • The flag M is set if more fragments follow. The bits Res are currently unused and reserved. • The Identification field contains the same value for all fragments of an IPv6 packet. Authentication Extension Header The Authentication Header (AH) extension header is used to provide data origin authentication, data integrity and replay protection services for IPv6 packets. The AH extension header as defined in RFC 2402 [37] has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Payload Len | RESERVED | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Security Parameters Index (SPI) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Authentication Data (variable) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Next Header field identifies the type of the payload following the AH extension header. • The Payload Len field contains the length of the AH extension header counted in the number of 32-words minus 1. • The Security Parameters Index field contains a value which together with the destination address identifies a so called Security Association (SA). The SA is basically a data structure which maintains all the necessary cryptographic information. • The Sequence Number field contains a monotonically increasing sequence number. The first packet which is sent after establishing a SA has the sequence number 1. If the sequence number reaches 232 , a new SA has to be established. • The Authentication Data field contains an integrity check value, ICV. The length of this field depends on the authentication function in use (which is determined by the SA). 46 CHAPTER 3. INTERNET NETWORK LAYER Encapsulating Security Payload Extension Header The Encapsulating Security Payload (ESP) extension header realizes security services such as confidentiality, data origin authentication, data integrity, replay protection and limited traffic flow confidentiality. The ESP as defined in RFC 2406 [38] has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Security Parameters Index (SPI) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Data* (variable) | ˜ ˜ | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Padding (0-255 bytes) | +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Pad Length | Next Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Authentication Data (variable) | ˜ ˜ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ---ˆAuth. |Cov|erage | ---| ˆ | | |Conf. |Cov|erage* | | v v ------ • The Security Parameters Index field contains a value which together with the destination address identifies a so called Security Association (SA). The SA is basically a data structure which maintains all the necessary cryptographic information. • The Sequence Number field contains a monotonically increasing sequence number. The first packet which is sent after establishing a SA has the sequence number 1. If the sequence number reaches 232 , a new SA has to be established. • The Payload Data field contains the encrypted payload (including any required initialization vectors). • The Padding field can be used to align the payload to a certain desired length or to provide a certain size required by the encryption function. The Padding field can also be used to hide the original size of the actual payload. • The Pad Length field contains the number of fill bytes. • The Next Header field identifies the type of the payload. • The Authentication Data field contains an integrity check value, ICV. The length of this field depends on the authentication function in use (which is determined by the SA). • Fragmentation can only happen after encryption. It is not allowed to apply ESP on a fragment. Hop-by-Hop Options Extension Header The Hop-by-Hop Options (HO) extension header carries optional information that must be examined by every node along a packet’s delivery path. The HO extension header as defined in RFC 2460 [17] has the following format: 3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 47 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | . . . Options . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Next Header field identifies the type of the payload following the HO extension header. • The Hdr Ext Len field contains the length of the HO counted in 64-bit words minus 1. • The Options field contains the list of options. Each option is encoded as a tag-length-value (TLV) triple: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- - - - - - - - | Option Type | Opt Data Len | Option Data +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- - - - - - - - The Option Type field identifies the option kind and the Option Data Len field contains the length of the Option Data field counted in bytes. The sequence of options in a HO extension header is processed in the order they appear in the header. Destination Options Extension Header The Destination Options (DO) extension header carries optional information that must be processed by the final receiver of the packet. The DO extension header as defined in RFC 2460 [17] has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | . . . Options . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The fields Next Header, Hdr Ext Len and Options have the same format and semantics as in the HO extension header. 48 CHAPTER 3. INTERNET NETWORK LAYER 3.3.4 IPv6 Forwarding IPv6 packets are forwarded using the longest prefix match algorithm which is used in the IPv4 network. However, IPv6 addresses have much longer prefixes which allows to do better address aggregation in order to reduce the number of forwarding table entries. On the other hand, due to the length of the prefixes, it is even more crucial to use an algorithm whose complexity does not dependent on the number of entries in the forwarding table or the average prefix length. 3.3.5 IPv6 Error Handling (ICMPv6) The Internet Control Message Protocol Version 6 (ICMPv6) is an adapted version of the ICMPv4 protocol. It introduces a set of control messages which are needed to report errors, to run diagnostic tests, to autoconfigure IPv6 nodes and to resolve IPv6 addresses to link-layer addresses. The ICMPv6 messages defined in RFC 2463 [39] have the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Message Body + | | • The Type field identifies the type of an ICMPv6 message. ICMPv6 messages are categorized into error message (Type 0-127) and informational messages (Type 128-255). Type 1 2 3 4 128 129 133 134 135 136 137 Description Destination Unreachable Packet Too Big Time Exceeded Parameter Problem Echo Request Echo Reply Router Solicitation Router Advertisement Neighbor Solicitation Neighbor Advertisement Redirect Reference RFC 2463 RFC 2463 RFC 2463 RFC 2463 RFC 2463 RFC 2463 RFC 2461 RFC 2461 RFC 2461 RFC 2461 RFC 2461 Table 3.2: ICMPv6 message types • The Code field contains a value which further discriminates the message type. The exact meaning of the Code value depends on the contents of the Type field. • The Checksum field contains the Internet checksum computed over the ICMPv6 message and parts of the IPv6 protocol header. • The contents of the Message Body depends on the ICMPv6 message type. 3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 3.3.6 49 IPv6 over IEEE 802.3 IPv6 packets can be encapsulated into IEEE 802.3 frames and sent over IEEE 802.3 packets as defined in RFC 2464 [40]: • Frames containing IPv6 packets are identified by the value 0x86dd in the IEEE 802.3 type field. • The link MTU is 1500 bytes which corresponds to the IEEE 802.3 maximum frame size of 1500 byte. • The mapping of IPv6 addresses to IEEE 802.3 addresses is table driven. Entries in so called mapping tables (sometimes also called address translation tables) can either be statically configured or dynamically learned using neighbor discovery. 3.3.7 IPv6 Neighbor Discovery IPv6 supports the automatic configuration of hosts (autoconfiguration) and the discovery of neighbors attached to the same link. The Neighbor Discovery (ND) which is part of the ICMPv6 protocol simplifies the configuration of hosts and includes features that are realized by different protocols (ICMPv4, ARP) in IPv4. ND as documented in RFC 2461 [41] supports the following features: • Discovery of the local routers that are attached to the same link (router discovery). • Discovery of the prefixes used on a link-layer so that it is possible to determine which IPv6 addresses can be reached directly (prefix discovery). • Discovery of parameters such as the link MTU or the hop limit for outgoing packets (parameter discovery). • Automatic configuration of IPv6 addresses (address autoconfiguration). • Resolution of IPv6 addresses to link-layer addresses (address resolution). • Determination of next-hop addresses for IPv6 destination addresses (next-hop determination). • Detection of unreachable nodes which are attached to the same link (neighbor unreachability detection). • Detection of conflicts that can arise during address generation (duplicate address detection). • Discovery of better alternatives to forward packets (redirect). The ND protocol uses some special IPv6 addresses: • all-nodes: The link-local multicast address FF02::1 is used to reach all nodes connected to a link. • all-routers: The link-local multicast address FF02::2 is used to reach all routers connected to a link. • solicited-node: A link-local multicast address which is derived from the address of a node which is formed by taking the low-order 24 bits of the address and appending those bits to the prefix FF02:0:0:0:0:1:FF00::/104. • link-local: A link-local unicast address which in the case of IEEE 802 links can be derived from the IEEE 802 MAC address as discussed above. The ND protocol is realized as an extension of the ICMPv6 protocol and introduces five new message formats. To prevent some attacks on the ND protocol, it is required that the Hop Limit field of the IPv6 protocol header is set to the value 255. Receiver of ND protocol messages must discard messages where the Hop Limit field does not contain the value 255. Packets can only contain a value unequal to 255 if the packet has been forwarded by a router, which might be a potential attack from somewhere outside the link. 50 CHAPTER 3. INTERNET NETWORK LAYER Router Solicitation Hosts can ask routers attached to a link to generate router advertisements by sending a Router Solicitation (RS) message to the all-routers link-local multicast group. The format of the RS message is as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options ... +-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 133 and the Code field contains the value 0. • The Checksum field contains the usual ICMPv6 checksum. • The Options field may contain the link-layer address of the sender, if known. Router Advertisement Routers send periodically or as a reaction to an RS message Router Advertisement (RA) messages to the all-nodes multi-cast group. The format of the RA message is as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Cur Hop Limit |M|O| Reserved | Router Lifetime | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reachable Time | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Retrans Timer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options ... +-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 134 and the Code field contains the value 0. • The Checksum field contains the usual ICMPv6 checksum. • The Cur Hop Limit field contains a proposed value which should be used by hosts in the Hop Limit field of outgoing IPv6 packets. • The flag M indicates that hosts should use in addition another mechanism such as DHCPv6 for the autoconfiguration of addresses (managed address configuration). • The flag O indicates that hosts should use in addition another mechanism such as DHCPv6 for the autoconfiguration of other parameters (other stateful configuration). • The Router Lifetime field defines the time (in seconds) in which the advertised router may be used as a default router. 3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 51 • The Reachable Time field defines the time (in milliseconds) in which a node assumes a neighbor is reachable after having received a reachability confirmation. • The Retrans Timer field defines the time (in milliseconds) between retransmitted Neighbor Solicitation messages. • The Options field may contain additional parameters such as the link-layer address of the sending router, the link MTU or information about the prefixes that are used on the link. Neighbor Solicitation Hosts can ask other notes attached to a link to generate neighbor advertisements by sending a Neighbor Solicitation (NS) message to the all-nodes link-local multicast group. The format of the NS message is as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Target Address + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options ... +-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 135 and the Code field contains the value 0. • The Checksum field contains the usual ICMPv6 checksum. • The Target Address field contains the address for which information is requested. • The Options field may contain the link-layer address of the sender, if known. Neighbor Advertisement Hosts send a Neighbor Advertisement (NA) message as a reaction to a Neighbor Solicitation. Unsolicited NA messages can also be sent in order to propagate changes quickly. Solicited NA messages are sent to the IPv6 address of the requestor whicl unsolicited NA messages are sent to the all-nodes multicast group. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R|S|O| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 52 CHAPTER 3. INTERNET NETWORK LAYER + + | | + Target Address + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options ... +-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 136 and the Code field contains the value 0. • The Checksum field contains the usual ICMPv6 checksum. • The flag R indicates that the sender is a router. The flag S indicates that the message is sent as a reaction to a Neighbor Solicitation message. The flag O indicates that the contained information should overwrite any existing cache entries. • The Options field may contain the link-layer address of the sender, if known. Redirect Router can generate Redirect (R) messages to inform hosts about better paths towards a given destination address. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Target Address + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Destination Address + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options ... +-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 137 and the Code field contains the value 0. • The Checksum field contains the usual ICMPv6 checksum. 3.4. ROUTING PROTOCOLS 53 • The Target Address field contains the IPv6 address or a router which provides a better path to the destination address. • The Destination Address field contains the destination address which is being redirected. • The Options field may contain the link-layer address of the Target Address, if known. In addition, the Options field should contain the beginning of the IPv6 packet which caused the generation of the Redirect message. 3.4 Routing Protocols The forwarding of packets in the Internet is controlled by forwarding tables which are present on all nodes. Every node has a more or less limited view about the overall network topology and relies on the help of other routers to move packets closer to their destination. In order to establish and maintain connectivity, it is necessary to update and synchronize these forwarding tables. While this can be done manually on small leaf networks, it is necessary to automate this process in larger and core backbone networks. Routing protocols have been developed for exactly this purpose. • For routing purposes, the Internet is divided into autonomous systems (ASs). An autonomous system (AS) is basically a set of routers and networks under the same administration. • The routing protocol(s) within an AS are called Interior Gateway Protocols (IGPs). They are generally independent from the routing protocols used in other ASs. Widely used IGPs are the Routing Information Protocol (RIP) and the Open Shortest Path First (OSPF) protocol. • The routing protocol(s) between ASs are called Exterior Gateway Protocols (EGPs). The currently most widely used EGP is the Border Gateway Protocol version 4 (BGP4). 3.4.1 Routing Information Protocol (RIP) The Routing Information Protocol version 2 (RIP-2) defined in RFC 2453 [42] is a simple routing protocol to be used within ASs. It is based on the exchange of distance vectors and thus falls into the class of distance vector routing protocols. The foundation of this protocol is the Bellman-Ford algorithm for computing shortest paths in graphs. Bellman-Ford Shortest Paths Algorithm • Let G = (V, E) be a graph with the vertices V and the edges E with n = |V | and m = |E|. • Let D be an n × n distance matrix in which D(i, j) denotes the distance from node i ∈ V to the node j ∈V. • Let H be an n × n matrix in which H(i, j) ∈ E denotes the edge on which node i ∈ V forwards a message to node j ∈ V . • Let M be a vector with the link metrics, S a vector with the start node of the links and D a vector with the end nodes of the links. 1. Set D(i, j) = ∞ for i 6= j and D(i, j) = 0 for i = j. 2. For all edges l ∈ E and for all nodes k ∈ V : Set i = S[l] and j = D[l] and d = M [l] + D(j, k). 3. If d < D(i, k), set D(i, k) = d and H(i, k) = l. 4. Repeat from step 2 if at least one D(i, k) has changed. Otherwise, stop. 54 CHAPTER 3. INTERNET NETWORK LAYER Properties • Simple distance vector protocols like RIP have the property that good news propagates quickly while bad news propagates relatively slowly. • In particular, the failure of links can lead to situations where the bad news propagates slowly by counting up the costs (count to infinity). • RIP defines infinity to be 16 hops. Hence, RIP can only be used in networks where the longest paths (the network diameter) is smaller than 16 hops. • RIP uses the number of hops as the only metric. Protocol RIP-2 runs over the User Datagram Protocol (UDP) and uses normally the port number 520. All RIP-2 messages have the following structure: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Command | Version | must be zero | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ˜ RIP Entries ˜ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Command field indicates, whether the message is a request or a response. Response messages can also be send without a previous request (unsolicited responses). • The Version field contains the protocol version number. • The RIP Entries field contains a list of so called fixed size RIP Entries. A RIP Entry has the following structure: 0 1 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Address Family Identifier | Route Tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IP Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Subnet Mask | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Hop | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Metric | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Address Family Identifier field identifies an address family. RIP was originally developed for networks with different address formats. • The Route Tag field marks entries which contain external routes, which might have been establised by an EGP. 3.4. ROUTING PROTOCOLS 55 • The IP Address field contains an IPv4 destination address. • The Subnet Mask field indicates the network prefix. • The Next Hop field contains the IPv4 address of the next hop router to which packets to the destination specified by this route entry should be forwarded. Specifying a value of 0.0.0.0 indicates that routing should be via the originator of the RIP advertisement. • The Metric field contains a value between 1 and 15 inclusive. The value 16 is used when the destination is not reachable (infinity). The first RIP Entry can have a special format to support authentication: 0 1 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 0xFFFF | Authentication Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Authentication + | | + + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The constant 0xFFFF is used to distinguish an authentication entry from other entries. • The Authentication Type field identifies an authentication scheme. The RIP-2 specification only defines a simple cleartext password authentication scheme. • The Authentication field contains data which is checked by the receiver to determine the authenticity of the message. • RFC 2082 [43] defines an authentication scheme based on MD5 which uses an additional trailer at the end of a RIP-2 message. 3.4.2 Open Shortest Path First (OSPF) The Open Shortest Path First (OSPF) protocol [44] is a routing protocol within autonomous systems and based on the idea that all nodes have access to the actual state of the links and the resulting topology (link state routing). Every node independently computes the shortest paths to all the other nodes by using Dijkstra’s shortest path algorithm. The link state information is distributed by flooding. Dijkstra’s Shortest Paths Algorithm 1. All nodes are initially labeled with infinite costs indicating that the costs to reach the node are not yet known. The cost label is marked as tentative and will be updated as the algorithm proceeds. 2. The costs of the root node are set to 0 and the root node is marked as the current node. 3. The cost label attached to the current node is marked permanent. 4. All direct adjacent nodes are now considered in turn. For each adjacent node, the costs for reaching the node are calculated by taking the costs of the current node and adding the link costs for the link that connects the current node to the adjacent node. If the resulting sum is smaller than the cost label of the adjacent node, the cost label is updated with the new cost and the name of the current node. 56 CHAPTER 3. INTERNET NETWORK LAYER 5. If there are still nodes with tentative cost labels, a node with the smallest costs is selected as the new current node. Goto step 3 if a new current node was selected. 6. The shortest paths to a destination node can now be read by following the labels from the destination node towards the root. OSPF Areas • An OSPF area is a group of a set of networks within an autonomous system. • The internal topology of an OSPF area is invisible for other OSPF areas. The routing within an area (intra-area routing) is constrainted to that area. • The OSPF areas are inter-connected via the OSPF backbone area (OSPF area 0). A path from a source node within one area to a destination node in another area has three segments (inter-area routing): 1. An intra-area path from the source to a so called area border router. 2. A path in the backbone area from the area border of the source area to the area border router of the destination area. 3. An intra-area path from the area border router of the destination area to the destination node. • OSPF routers are classified according to their location in the OSPF topology: 1. Internal Router: A router where all interfaces belong to the same OSPF area. 2. Area Border Router: A router which connects multiple OSPF areas. An area border router has to be able to run the basic OSPF algorithm for all areas it is connected to. 3. Backbone Router: A router that has an interface to the backbone area. Every area border router is automatically a backbone router. 4. AS Boundary Router: A router that exchanges routing information with routers belonging to other autonomous systems. • Stub Areas are OSPF areas with a single area border router. The routing in stub areas can be simplified by using default forwarding table entries which significantly reduces the overhead. Protocol OSPF messages are carried in IP packets. The value of the Protocol of the IPv4 header or the Next Header of the IPv6 header is 89 for the OSPF protocol. All OSPF messages have the same header: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version # | Type | Packet Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Router ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Area ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | AuType | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Authentication | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Authentication | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3.4. ROUTING PROTOCOLS 57 • The Version field contains the OSPF version number, currently 2. • The Type field identifies the message type. • The Packet Length field contains the length of the whole OSPF message counted in bytes. • The Router ID field identifies the router who originated an OSPF message. • The Area ID identifies the OSPF area. The Area ID for the OSPF backbone is 0, often written in the dotted quad notation as 0.0.0.0. • The Checksum field contains the Internet checksum computed over the whole OSPF message without the authentication field. • The AuType field identifies the type of authentication procedure in use. • The Authentication field contains authentication data. The format of the authentication data depends on the authentication type. Hello The Hello protocol is used to test the status of links and the attached neighbors. The hello protocol works differently on broadcast networks, non-broadcast multi-access networks and point-to-multipoint networks. On broadcast and non-broadcast multi-access networks, the hello protocol selects a Designated Router and a Backup Designated Router. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version # | Type = 1 | Packet length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Router ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Area ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | AuType | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Authentication | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Authentication | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Network Mask | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | HelloInterval | Options | Rtr Pri | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RouterDeadInterval | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Designated Router | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Backup Designated Router | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Neighbor | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | • Der ersten Felder enthalten den normalen OSPF-Nachrichtenkopf, wobei das Feld Type den Wert 1 hat. 58 CHAPTER 3. INTERNET NETWORK LAYER • Das Feld Network Mask enth”alt die Netzmaske f”ur das Interface. • Das Feld HelloInterval enth”alt das Zeitintervall in Sekunden zwischen aufeinanderfolgende Hello-Nachrichten. • Das Feld Rtr Pri enth”alt die Priorit”at des Routers, die f”ur die Auswahl des Designated bzw. Backup Designated Routers verwendet wird. Router mit der Priorit”at 0 nehmen nicht an der Auswahl teil. • Das Feld RouterDeadInterval definiert das Zeitintervall in Sekunden, nachdem ein Router als nicht mehr erreichbar betrachtet wird. • Das Feld Designated Router enth”alt die Identit”at des Designated Routers bzw. 0 falls noch kein Designated Router bekannt ist. • Das Feld Backup Designated Router enth”alt die Identit”at des Backup Designated Router bzw. 0 falls noch kein Backup Designated Router bekannt ist. • Am Ende der Nachricht befindet sich eine Liste von Neighbor Feldern, wobei jedes Neighbor Feld die Identit”at eines Routers anzeigt, von dem im letzten RouterDeadInterval eine HelloNachrichten empfangen wurde. • Ein Link wird als verf”ugbar betrachtet, wenn Hello-Nachrichten in beide Richtungen ausgetauscht werden k”onnen. Bei direkten Verbindungen (point-to-point links, virtual links) kann, sobald der Link als verf”ugbar erkannt wurde, mit dem Austausch der Datenbasis begonnen werden. • Bei Netzwerk-Verbindungen (broadcast links, non-broadcast links) wird zun”achst der Designated Router und der Backup Designated Router bestimmt: 1. Zun”achst verh”alt sich ein Router f”ur ein RouterDeadInterval passiv indem er eingehende Hello-Nachrichten sammelt und eigene Hello-Nachrichten generiert, in denen er sich nicht zu Wahl stellt. Anschlie”send werden nur die Nachbarn betrachtet, f”ur die der Link in beide Richtungen verf”ugbar ist. 2. Wenn einer oder mehrere Router sich als Backup Designated Router angeboten haben, wird der Router mit der h”ochsten Priorit”at ausgew”ahlt. Sollte die Priorit”at nicht eindeutig sein, wird aus den Kandidaten der Router mit der gr”o”sten Identifikationsnummer ausgew”ahlt. 3. Wenn kein Router sich als Backup Designated Router angeboten hat, wird der Router mit der h”ochsten Priorit”at (und der gr”o”sten Identifikationsnummer) ausgew”ahlt. 4. Wenn einer oder mehrere Router sich als Designated Router angeboten haben, wird der Router mit der h”ochsten Priorit”at ausgew”ahlt. Sollte die Priorit”at nicht eindeutig sein, wird aus den Kandidaten der Router mit der gr”o”sten Identifikationsnummer ausgew”ahlt. 5. Wenn kein Router sich als Designated Router angeboten hat, wird der Router mit der h”ochsten Priorit”at (und der gr”o”sten Identifikationsnummer) ausgew”ahlt. Ein Router kann nicht zugleich Designated Router und Backup Designated Router sein. Daher m”ussen nach dem Schritt 5 die Schritte 2 und 3 wiederholt werden. Exchange Das Exchange-Protokoll hat die Aufgabe die Datenbasis initial zu synchronisieren. ... Flooding ... 3.4. ROUTING PROTOCOLS 3.4.3 59 Border Gateway Protocol (BGP) Autonomous systems usually perform policy-based routing by using the Border Gateway Protocol version 4 (BGP4) as defined in RFC 1771 [45] to exchange reachability information between autonomous systems (ASs). The reachability information is sufficient to construct a graph of ASs connectivity from which routing loops may be pruned and some policy decisions at the autonomous system level be enforced. BGP4 runs over the reliable transport protocol TCP which eliminates explicit fragmentation, retransmission, acknowledgement, and sequencing. BGP4 uses TCP port 179 for establishing connections between two BGP4 peers which are typically located in different ASs. When two ASs agree to exchange routing information, each AS must designate a router that will speak BGP4 on its behalf. These two routers are called BGP4 peers. The peers establish a TCP connection and run the BGP4 protocol which basically has three phases: 1. The BGP4 peers exchange messages to open and confirm connection parameters. 2. The BGP4 peers exchange initially the entire BGP routing table. Incremental updates are sent as the routing tables change. 3. The BGP4 peers exchange so called keep-alive messages periodically to ensure that the connection and the BGP4 peers are alive. BGP4 Message Header Each BGP4 message has a fixed-size header which may or may not be followed by a data portion: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + + | Marker | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Marker field contains a value (initially all 1s) that the receiver of the BGP4 message can predict and verify. The Marker field can be used to detect loss of synchronization and to authenticate incoming BGP messages. • The Length field indicated the total length of the message including the header, counted in bytes. The maximum length of BGP4 messages is 4096 bytes. • The Type field indicates the type of the message. The following message types are defined: 1. OPEN 2. UPDATE 3. NOTIFICATION 4. KEEPALIVE 60 CHAPTER 3. INTERNET NETWORK LAYER It is important to realize that BGP peers in general advertise only routes that should be seen from the outside. There might be additional possible routes which for policy reasons are not announced to other peers. Furthermore, it is important to realize that BGP only advertises routing information. The final decision which paths are selected by putting approriate entries into the forwarding tables remains a local policy decision. For some analysis about the usage of BGP, the growth of BGP routing tables and the increase of AS numbers, see [46]. BGP4 Open Message Once a TCP connection has been established between two BPG4 peers, they both send an OPEN message to communicate their AS number and to establish other parameters. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+ | Version | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Autonomous System Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hold Time | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | BGP Identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Opt Parm Len | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Optional Parameters | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Version field contains the protocol version number. • The Autonomous System Number field contains the 16-bit AS number of the sender. • The Hold Time field specifies the maximum time that the receiver should wait for a response from the sender. • The BGP Identifier field contains a 32-bit value which uniquely identifies the sender. This identifier is by definition selected from the IPv4 addresses of the sender. • The Opt Parm Len field contains the total length of the Optional Parameters field or zero if no optional parameters are present. • The Optional Parameters field contains a list of parameters. Each parameter is encoded using a tag-length-value (TLV) triple. BGP4 Update Message The UPDATE messages are used to transfer routing information between BGP peers. The information in the UPDATE packet can be used to construct a graph describing the relationships of the various Autonomous Systems. An UPDATE message may simultaneously advertise a feasible route and withdraw multiple unfeasible routes from service. Hence, the UPDATE message consists of two parts: 3.4. ROUTING PROTOCOLS 61 1. The list of unfeasible routes that are being withdrawn. 2. The feasible route to advertise. +-----------------------------------------------------+ | Unfeasible Routes Length (2 octets) | +-----------------------------------------------------+ | Withdrawn Routes (variable) | +-----------------------------------------------------+ | Total Path Attribute Length (2 octets) | +-----------------------------------------------------+ | Path Attributes (variable) | +-----------------------------------------------------+ | Network Layer Reachability Information (variable) | +-----------------------------------------------------+ • The Unfeasible Routes Length field indicates the total length of the Withdrawn Routes field counted in bytes. The value 0 indicates that no routes are being withdrawn. • The Withdrawn Routes field contains a list of IPv4 address prefixes that are being withdrawn from service. Each IPv4 address prefix is encoded as a 2-tuple of the form (length, prefix) where the length indicates the prefix length and prefix contains the IPv4 address prefix bits padded to the next byte boundary. • The Total Path Attribute Length field indicates the total length of the Path Attributes field counted in bytes. • The Path Attributes field contains a list of path attributes. Each attribute is encoded using a tag-length-value (TLV) triple. Path attributes convey information such as the origin of the path information (ORIGIN), the sequence of AS path segments (AS PATH), the IPv4 address of the border router that should be used as the next hop (NEXT HOP), or the local preference assigned by a BGP4 speaker (LOCAL PREF). • The Network Layer Reachability Information field contains a list of IPv4 prefixes. address prefixes that are being withdrawn from service. Each IPv4 address prefix is encoded as a 2-tuple of the form (length, prefix) where the length indicates the prefix length and prefix contains the IPv4 address prefix bits padded to the next byte boundary. BGP4 Notification Message BGP4 supports a NOTIFICATION message type used for control or when an error occurs. The transport connection is closed immediately after sending a NOTIFICATION message. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Error code | Error subcode | Data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Error code field and the Error subcode field contain one of the following error codes: 62 CHAPTER 3. INTERNET NETWORK LAYER Code 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 4 5 6 Description Message Header Error Message Header Error Message Header Error OPEN Message Error OPEN Message Error OPEN Message Error OPEN Message Error OPEN Message Error OPEN Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error UPDATE Message Error Hold Timer Expired Finite State Machine Error Cease Subcode 1 2 3 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 Description Connection Not Synchronized Bad Message Length Bad Message Type Unsupported Version Number Bad Peer AS Bad BGP Identifier Unsupported Optional Parameter Authentication Failure Unacceptable Hold Time Malformed Attribute List Unrecognized Well-known Attribute Missing Well-known Attribute Attribute Flags Error Attribute Length Error Invalid ORIGIN Attribute AS Routing Loop Invalid NEXT HOP Attribute Optional Attribute Error Invalid Network Field Malformed AS PATH Table 3.3: BGP4 error codes and error subcodes BGP4 Keep Alive Message BGP4 peers periodically exchange KEEPALIVE messages. A KEEPALIVE message consists of the standard BGP4 header with no additional data. The KEEPALIVE messages are needed to verify that shared state information is still present. If a BGP4 peer does not receive a message within the Hold Time, then the peer will assume that there is a communication problem and tear down the connection. Chapter 4 Internet Transport Layer The transport layer is responsible for providing application protocols suitable transport services. Some application protocols require a stream-based connection while others prefer a reliable datagram service and yet others are happy with a lightweight unreliable datagram service. SMTP HTTP FTP Transport Layer IP Address + Port Number IP Address IP Layer Figure 4.1: Transport layer multiplexing/demultiplexing using port numbers • IP addresses are network layer endpoints and identify interfaces on nodes (hosts or routers)1 . Network addresses have node-to-node significance. • Transport layer endpoints identify communicating application processes and are in the Internet represented by a tuple consisting of an IP address and a 16-bit port number. Transport addresses have end-to-end significance. • The number space for port numbers is divided in a port number range that can freely be used and a port number range which is managed by the Internet Assigned Numbers Authority (IANA). Well-known port numbers for standardized or frequently used protocols can be registered by IANA. • Port numbers basically allow to multiplex/demultiplex and packets at the transport layer as shown in Figure 4.1. There are currently four important transport protocols in the Internet: 1. The User Datagram Protocol (UDP) provides a simple unreliable best-effort datagram service. 1 It is worth to note that IP addresses are typically used to identify (interfaces of) nodes as well as their location in the network. This dual role of IP addresses becomes interesting in the context of mobile devices. 63 64 CHAPTER 4. INTERNET TRANSPORT LAYER 2. The Transmission Control Protocol (TCP) provides a bidirectional, connection-oriented and reliable data stream. 3. The Stream Control Transmission Protocol (SCTP) provides a reliable transport service supporting sequenced delivery of messages within multiple streams. SCTP maintains application protocol message boundaries (application protocol framing) and was designed to support signaling protocols. 4. The Real-Time Transport Protocol (RTP) provides a transport service for real-time multi-media applications where different data streams have to be synchronized. RTP is often implemented on top of UDP (and thus from the layering not a pure transport layer protocol). 4.1 Pseudo Header Many Internet transport protocols contain an Internet checksum which is computed over the transport layer header and a so called pseudo header which contains some selected and immutable fields of the IP header. The IPv4 pseudo header consists of the IPv4 source and destination address plus the protocol number and the length of the transport layer message [47, 48]. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | unused (0) | Protocol | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The IPv6 pseudo header consists of the IPv6 source and destination address, the length of the transport layer message and the next header field value which identifies the transport protocol. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Source Address + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Destination Address + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Upper-Layer Packet Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | zero | Next Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.2. USER DATAGRAM PROTOCOL (UDP) 4.2 65 User Datagram Protocol (UDP) The User Datagram Protocol (UDP) is defined in RFC 768 [47] and provides a simple unreliable best-effort datagram service. The UDP protocol header basically extends the IP header with the source and destination port numbers and a checksum. UDP packets are identified by the value 17 in the IPv4 Protocol field or the IPv6 Next Header field. The UDP header has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Source Port field contains the port number used by the sending application layer process. • The Destination Port field contains the port number used by the receiving application layer process. • The Length field contains the length of the UDP datagram including the UDP header counted in bytes. • The Checksum field contains the Internet checksum computed over the pseudo header, the UDP header and the payload contained in the UDP packet. 4.3 Transmission Control Protocol (TCP) The Transmission Control Protocol (TCP) is defined in RFC 793 [48] and provides a bidirectional connectionoriented and reliable data stream over an unreliable connection-less network protocol. Applications exchange an unstructured byte stream and the TCP connection can be used in a bidirectional and an unidirectional mode. TCP provides end-to-end flow control using a windowing technique with adaptive timeouts and an automatic slow-down in congestion situations. The data stream provided by an application is split into so called segments for transmission. Every data segment is prefixed with a TCP header before it is sent as the payload of an IP packet. The TCP header has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Offset| Reserved | Flags | Window | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 66 CHAPTER 4. INTERNET TRANSPORT LAYER • The Source Port field contains the port number used by the sending application layer process. • The Destination Port field contains the port number used by the receiving application layer process. • The Sequence Number field contains after connection establishment the sequence number of the first data byte in the segment. During connection establishment, this field is used to establish the initial sequence number. • The Acknowledgment Number field contains the next sequence number which the sender of the acknowledgement expects. • The Offset field contains the length of the TCP header including any options, counted in 32-bit words. • The Flags field contains a set of binary flags: – – – – – – URG: Indicates that the Urgent Pointer field is significant. ACK: Indicates that the Acknowledgment Number field is significant. PSH: Data should be pushed to the application as quickly as possible. RST: Reset of the connection. SYN: Synchronization of sequence numbers. FIN: No more data from the sender. • The Window field indicates the number of data bytes which the sender of the segment is willing to receive. • The Checksum field contains the Internet checksum computed over the pseudo header, the TCP header and the data contained in the TCP segment. • The Urgent Pointer field points, relative to the actual segment number, to important data if the URG flag is set. • The Options field can contain additional options. 4.3.1 Connection Establishment Two communicating TCP protocol engines have to agree on a number of parameters. The connection establishment procedure establishes these parameters using a three-way handshake protocol. The handshake protocol guarantees correct connection establishment, even if TCP packets are lost or duplicated. In the normal case (no packet loss), the three-way handshake is performed as shown in Figure 4.2. Active Open Passive Open SYN x SYN x ACK x+1, SYN y ACK x+1, SYN y ACK y+1 ACK y+1 Figure 4.2: TCP three-way connection establishment handshake 67 4.3. TRANSMISSION CONTROL PROTOCOL (TCP) • One of the TCP protocol engines first waits passively for incoming connections (passive open). • The other TCP protocol engine actively initiates the connection establishment procedure (active open). • The first TCP packet contains the initial sequence number in a SYN packet. The initial sequence number is determined by a counter which is incremented roughly all 4 microseconds. This guarantees that the initial sequence number is not reused as long as old packets may still exist in the network. • The passive TCP engine stores the received sequence number and sends its own randomly created initial sequence number, at the same time acknowledging the received SYN packets. • The active TCP engine stores the received the sequence number and acknowledges the receipt of this sequence number. 4.3.2 Connection Tear-down TCP provides initially a bidirectional data stream after completing the connection establishment procedure. It is possible to turn the bidirectional connection into a unidirectional connection by closing one half of the connection. The TCP connection itself is terminated when both unidirectional connections have been closed. In the normal case (no packet loss), the connection tear-down is performed as shown in Figure 4.3. Active Open Passive Open FIN x FIN x ACK x+1 ACK x+1 ACK x+1, FIN y ACK x+1, FIN y ACK y+1 ACK y+1 Figure 4.3: TCP connection teardown • The connection tear-down procedure is started by a TCP protocol engine by setting the FIN flag. • The receiver usually first acknowledges the receipt of the FIN packet. • The receiving protocol engine the informs the application about the tear-down of the first half of the connection. • Once the application indicates that it wants to close the other half of the connection, another TCP packet is transmitted into the other direction with a FIN flag set. • The receiver of the second FIN packet acknowledges the receipt of the second FIN packet and the connection is closed. In cases where a connection between two TCP engines is interrupted (e.g., a cable breaks or a node is turned off), the TCP specification requires quiet time of 120 seconds (maximum segment lifetime, MSL) before new TCP connections can be established. The quiet time is motivated by the time needed to ensure that packets belonging to the broken TCP connection have disappeared from the network. 68 4.3.3 CHAPTER 4. INTERNET TRANSPORT LAYER State Machine The various transitions possible during connection establishment and tear-down are best described by a finite state machine as shown below. +---------+ ---------\ active OPEN | CLOSED | \ ----------+---------+<---------\ \ create TCB | ˆ \ \ snd SYN passive OPEN | | CLOSE \ \ ------------ | | ---------\ \ create TCB | | delete TCB \ \ V | \ \ +---------+ CLOSE | \ | LISTEN | ---------- | | +---------+ delete TCB | | rcv SYN | | SEND | | ----------| | ------| V +---------+ snd SYN,ACK / \ snd SYN +---------+ | |<---------------------------------->| | | SYN | rcv SYN | SYN | | RCVD |<-----------------------------------------------| SENT | | | snd SYN, ACK | | | |------------------------------------| | +---------+ rcv ACK of SYN \ / rcv SYN,ACK +---------+ | -------------| | ----------| x | | snd ACK | V V | CLOSE +---------+ | ------| ESTAB | | snd FIN +---------+ | CLOSE | | rcv FIN V ------| | ------+---------+ snd FIN / \ snd ACK +---------+ | FIN |<---------------------------------->| CLOSE | | WAIT-1 |-----------------| WAIT | +---------+ rcv FIN \ +---------+ | rcv ACK of FIN ------| CLOSE | | -------------snd ACK | ------- | V x V snd FIN V +---------+ +---------+ +---------+ |FINWAIT-2| | CLOSING | | LAST-ACK| +---------+ +---------+ +---------+ | rcv ACK of FIN | rcv ACK of FIN | | rcv FIN -------------- | Timeout=2MSL -------------- | | ------x V -----------x V \ snd ACK +---------+delete TCB +---------+ ------------------------>|TIME WAIT|------------------>| CLOSED | +---------+ +---------+ The TCP state machine has the states shown in Table 4.1. 69 4.3. TRANSMISSION CONTROL PROTOCOL (TCP) CLOSED LISTEN SYN-RCVD SYN-SENT ESTABLISHED FIN-WAIT-1 FIN-WAIT-2 TIMED-WAIT CLOSING CLOSE-WAIT LAST-ACK Initial and final state Wait for incoming connection requests (passive open) Received connection request (passive open) Initiated connection establishment (active open) Connection is established and operational Started connection tear-down procedure Waiting for connection tear-down form remote end Waiting for remote engine to receive tear-down acknowledgement Both engines close simultaneously Remove engine started connection tear-down procedure Wait for last acknowledgement or until all segments have disappeared Table 4.1: TCP protocol engine states 4.3.4 Flow Control TCP uses a windowing approach to implement flow control. During connection establishment, both TCP engines advertise their buffer sizes. The available space left in the receiving buffer is advertised as part of the acknowledgements. Senders must not send more data in order to protect the receiver’s buffers. The only exception is the window size 0. If the window has reached a size of 0 bytes, the sender still may send data in the following two cases: 1. The sending application delivers urgent data that should be transmitted and delivered to the remote application as fast as possible. 2. The sending application may send a 1 byte segment in order to make the receiver reannounce the next byte expected and the current window size. This is useful to protect against deadlocks that can otherwise occur if a window update is lost. The following illustrative example shown in Figure 4.4 is taken from [2]. Sender Receiver 0 write(2K) 4K 2K | SEQ = 0 0 4K ACK = 2048 Win = 2048 write(2K) 2K | SEQ = 2048 0 ACK = 4096 Win = 0 4K read(2K) 4K 0 ACK = 4096 Win = 2048 Figure 4.4: TCP flow control Suppose the receiver has a 4096 byte buffer and the sender has 2048 bytes ready to send. The sender will immediately transmit a 2048 byte segment (assuming that the path MTU is large enough). The receiver now fills half of the buffer and announces a new window size of 2048 bytes in the acknowledgement. Once the application has 2048 more bytes to send, another 2048 byte segment is transmitted. This now fully fills the receiver’s buffer which leads to an announcement of 70 CHAPTER 4. INTERNET TRANSPORT LAYER a window of size 0 in the following acknowledgement. Once the receiver’s application process consumes data, another acknowledgement will be created which informs the sender of a new window size. The TCP specification does not require that acknowledgements are created immediately for each received segment. This allows for optimizations where a receiver might choose to send just a single acknowledgement for several segments that have been received quickly in sequence. Furthermore, the receiving TCP engine might choose to send the acknowledgement delayed so that the acknowledgement can be piggybacked on other data send from the receiver to the sender. Nagle’s Algorithm The original TCP behaves rather ineffective in situations where an application sends a stream of very small (one byte) payloads. In the extreme case, the sender sends a segment containing one byte payload. The receiver responds with an acknowledgement for that single byte. One the received byte has been copied to the application process, another acknowledgement is send to advertise a new window size. The sending application may now have another byte to send and the process repeats. Nagle suggested to solve this problem by introducing the following rule: When data comes into the sender one byte at a time, just send the first byte and buffer all the rest until the byte in flight has been acknowledgement. This algorithm provides noticeable improvements especially for interactive traffic where a quickly typing user is connected over a rather slow network. Clark’s Algorithm Another related problem is known as the silly window syndrome. This problem deals with applications on the receiving side that read the data one byte at a time from the receiver’s buffer. The original TCP implementations immediately announced a window of one byte when the application removed a byte from the receive buffer. This acknowledgement then causes the transmission of another TCP segment which contains again just one byte of data. Clark suggested to solve this problem by preventing the receiver from sending a window update of 1 byte. Specifically, the receiver should not send a window update until it can handle the maximum segment size it advertised when the connection was established or until its buffer is half empty, whichever is smaller. 4.3.5 Congestion Control A detailed discussion of the congestion control mechanism used by TCP can be found in RFC 2581 [49] while RFC 3390 [50] increases the size of the initial window. The following text is a rather short summary of these RFCs. TCP’s congestion control introduces the concept of a congestion window (cwnd) which defines how much data can be in transit. The congestion window is maintained by a TCP sender in addition to the flow control receiver window (rwnd) which is advertised by the receiver. The sender uses these two windows to limit the data that is sent to the network and not yet received (flight size) to the minimum of the receiver and the congestion window: f lightsize ≤ min(cwin, rwin) The key problem to be solved is the dynamic estimation of the congestion window. The solution adopted by TCP assumes that lost segments are indications of congestion. While this is true in most wired networks, this assumption does not work that well in wireless networks where the loss rate is much higher. Recent work also introduced explicit congestion notifications which can be used by a router in the network to indicate congestion without having to drop packets. A TCP connection usually has different phases where different congestion control techniques should be used: • After connection establishment, no suitable value for the congestion window is known. The solution is to define in initial window size (IW) and to start probing the network for the real congestion window using the slow start algorithm. • Once a certain threshold, the so called slow start threshold (ssthresh) for the congestion window has been crossed, the connections enters the congestion avoidance phase in which the congestion window increases linearly. 71 4.3. TRANSMISSION CONTROL PROTOCOL (TCP) • If a timeout occurs or congestion is signalled by other means, the slow start threshold ssthresh is reduced and the congestion window is set to the so called loss window (LW) which is one full-sized segment. The sender now switches back to the slow start algorithm until ssthresh is crossed and congestion avoidance takes over. • After a long period if idle time, the congestion window cwin is usually not accurate anymore. Hence, the value of the congestion window must be set to the restart window (RW) which is typically the same as the initial window (IW) and the slow start algorithm is executed. The initial window (IW) is usually initialized using the following formula: IW = min(4 · SM SS, max(2 · SM SS, 4380bytes)) In this formula, SM SS is the sender maximum segment size, the size of the largest sement that the sender can transmit. The size does not include the TCP/IP headers and options. During slow start, the congestion window cwnd increases by at most SM SS bytes for every acknowledgement received that acknowledges data. Slow start ends when cwnd exceeds ssthresh or when congestion is observed. The initial value of ssthresh may be arbitrarily high. Some implementations use the size of the advertised window. Note that this leads to an exponential increase if there are multiple segments acknowledged in the cwnd. During congestion avoidance, cwnd is incremented by one full-sized segment per round-trip time (RTT). Congestion avoidance continues until congestion is detected. One formula commonly used to update cwnd during congestion avoidance is given by the following equation: cwnd = cwnd + (SM SS ∗ SM SS/cwnd) This adjustment is executed on every incoming non-duplicate ACK. The equation provides an acceptable approximation to the underlying principle of increasing cwnd by one full-sized segment per RTT. When congestion is noticed during slow start of congestion avoidance, (the retransmission timer expires), then the slow start threshold ssthresh is updated as follows: ssthresh = max(f lightsize/2, 2 · SM SS) The flight size is the amount of outstanding data in the network. The impact of slow start and congestion avoidance is summarized in Figure 4.5 44 40 congestion window (cwin) 36 timeout 32 28 ssthresh 24 20 16 12 8 4 0 2 4 6 8 10 12 14 16 18 20 transmission number Figure 4.5: TCP slow start and congestion avoidance 22 24 26 28 72 CHAPTER 4. INTERNET TRANSPORT LAYER Fast Retransmit / Fast Recovery To reduce the time for retransmissions, TCP receivers should send an immediate duplicate acknowledgement when an out-of-order segment arrives. The purpose of this acknowledgement is to inform the sender that a segment was received out-of-order and which sequence number is expected. In addition, a TCP receiver should send an immediate acknowledgement when the incoming segment fills in all or part of a gap in the sequence space. TCP senders should use the fast retransmit algorithm to detect and repair loss. The fast retransmit algorithm uses the arrival of three duplicate acknowledgements (four identical acknowledgements without the arrival of any other intervening packets) as an indication that a segment has been lost. After receiving three duplicate acknowledgements, TCP performs a retransmission of what appears to be the missing segment, without waiting for the retransmission timer to expire. The fast recovery algorithm controls how the congestion window and the slow start threshold is updated when the fast retransmit algorithm is used. The basic idea is to not exercise the normal congestion reaction with a full slow start since acknowledgements are still flowing. For details how fast retransmit and fast recovery are implemented, see Section 3.2 in RFC 2581 [49] 4.3.6 Retransmission Timer The retransmission timer controls when a segment is resend if no acknowledgement has been received. Setting the retransmission timer to a reasonable value is rather difficult in the context of TCP since the round-trip time varies and the mean and the variance of the round-trip time distribution can change rapidly within a few seconds as congestion builds up or is resolved. The solution adopted in TCP is to measure the round-trip time constantly and to adjust the timeout interval. For each connection, TCP maintains an estimation of the current round-trip time in a variable called RT T . If an acknowledgement is received for a segment before the associated retransmission timer expires, the estimation of the round-trip time is updated as follows: RT T = α · RT T + (1 − α)M The parameter α is a smoothing factor and typically set to α = 78 . M is the measured round-trip time, that is the time that has passed between sending the segment and receiving the corresponding acknowledgement. In order to determine a good value for the retransmission time, the variance of the round-trip time distribution also has to be taken into account. TCP uses a cheap estimator for the standard deviation which is very efficient to compute: D = α · D + (1 − α)|RT T − M | The parameter α is another smoothing factor which can be different from the smoothing factor used to estimate RT T . With these estimations of the average round-trip time and the standard deviation, the retransmission timeout RT O is usually set as follows: RT O = RT T + 4 · D The factor 4 is more or less chosen by doing experiments. According to some studies, less than one percent of all packets come in more than four standard deviations late. Karn’s Algorithm The dynamic estimation of the RT T has a problem if a timeout occurs and the segment is retransmitted. A subsequent acknowledgement might acknowledge the receipt of the first packet which contained that segment or any of the retransmissions. Guessing wrong can seriously impact the RT T estimation. Karn therefore suggested that the RT T estimation is not updated for any segments which were retransmitted. Furthermore, Karn suggested that the RT O is doubled on each failure until the segment gets through which leads to an exponential back-off for each consecutive attempt. These fixes are now known as Karn’s algorithm. 4.4. STREAM CONTROL TRANSMISSION PROTOCOL (SCTP) 4.4 73 Stream Control Transmission Protocol (SCTP) The Stream Control Transmission Protocol (SCTP) is defined in RFC 2960 [51]. An shorter introduction can be found in RFC 3286 [52]. SCTP provides provides a reliable transport service, ensuring that data is transported without error and delivered in sequence. SCTP is message oriented and preserves the boundaries of application layer messages (application layer framing). Data transfer between two SCTP hosts takes place in the context of an association. An association may contain multiple data streams and each stream has the property of independently sequenced delivery. A message loss in one stream thus does not affect the delivery in other streams. SCTP accomplishes multi-streaming by creating independence between data transmission and data delivery. Unlike TCP, SCTP allows SCTP endpoints to have multiple IP addresses. This multi-homing feature provides the benefit of potentially greater survivability of an SCTP association in the presence of network failures. An SCTP packet is composed of a common header and chunks. A chunk contains either control information or user data. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Common Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Chunk #1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Chunk #n | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Multiple chunks can be bundled into one SCTP packet up to the MTU size. If a user data message doesn’t fit into one SCTP packet it can be fragmented into multiple chunks. The SCTP common header has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Verification Tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Source Port field contains the port number used by the sending application layer process. • The Destination Port field contains the port number used by the receiving application layer process. • The Verification Tag field is used by the receiver to validate the sender of the SCTP packet. This field protects the SCTP protocol against certain attacks. • The Checksum field contains a 32-bit CRC checksum as specified in RFC 3309 [53]. Payload is transmitted in so called data chunks. Data chunks have the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 0 | Flags | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Transmission Sequence Number (TSN) | 74 CHAPTER 4. INTERNET TRANSPORT LAYER +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Stream Identifier (S) | Stream Sequence Number (n) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Protocol Identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / User Data (seq n of Stream S) / \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Type field indicates the chunk type. Data chunks use the type number 0. • The Flags field contains a set of binary flags: – U: The data chunk is unordered and there is no Stream Sequence Number assigned to the data chunk. – B: Indicates the beginning of a fragment of a user message. – E: Indicates the ending of a fragment (last fragment) of a user message. • The Length field indicates the size of the chunk in bytes including the chunk header fields. • The Transmission Sequence Number field contains the transmission sequence number which is also used by the receiver to reassemble messages. • The Stream Identifier identifies the stream to which the following data belongs. • The Stream Sequence Number identifies the stream sequence number of the following user data within the stream identified by the Stream Identifier. • The Payload Protocol Identifier identifies the upper layer application protocol and is opaque from the viewpoint of an SCTP protocol engine. • The User Data is of variable length and contains the actual payload. 4.5 Datagram Congestion Control Protocol (DCCP) The Datagram Congestion Control Protocol (DCCP) is the latest addition to the set of Internet transport protocols. It provides a congestion controlled, unreliable flow of datagrams suitable for use by applications such as streaming media. DCCP is connection oriented and has a connections setup, data exchange and teardown phase. However, in contrast to TCP and SCTP, DCCP does not provide ordered delivery nor does it provide end-to-end flow control. DCCP supports path MTU discovery and explicit congestion notifications (ECNs). The congestion control mechanism used for a DCCP connection can be negotiated. DCCP therefore supports different congestion control mechanisms which are identified by so call congestion control identifiers (CCIDs). A TCP-like congestion control mechanism is identified by CCID 2. A TCP-Friendly Rate Control (TFRC) mechanism is identified by CCID 3. DCCP uses nine different packet types, which are typicall used in the following sequence: 1. A DCCP-Request is used by the client to initiate a connection with a server. 2. A DCCP-Response is send by the server to indicate that it is willing to talk to a client. 3. A DCCP-Ack packet is send by the client to acknowledge to a DCCP-Response. Further DCCP-Ack packets might be exchanged to negotiate options. 4. After the connection has been established, DCCP-Data, DCCP-Ack and DCCP-DataAck packets are exchanged to exchange payload data and acknowledgements. 5. The server sends a DCCP-CloseReq packet requesting to close the connection. 6. The client sends a DCCP-Close packet acknowledging the request to close the connection. 7. The server sends a DCCP-Reset packet to clear the connection state. The connection teardown can also be started by the client. In this case, the client sends a DCCP-Close to close the connection. The server responds with a DCCP-Reset packet to clear the connection state. All DCCP packets use a common header: 4.5. DATAGRAM CONGESTION CONTROL PROTOCOL (DCCP) 75 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Dest Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | CCval | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Offset | # NDP | Cslen | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ • The Source Port field contains the port number used by the sending application layer process. • The Destination Port field contains the port number used by the receiving application layer process. • The Type field indicates the type of the DCCP message. • The CCval field contains data which might be used by a congestion control mechanism. • The Sequence Number field contains a sequence number which counts the number of packets. • The Data Offset field contains the offset from the start of the DCCP header to the start of the payload, counted in 32-bit words. • The NDP field contains the number of non-data packets send on the senders sequence, modulo 16. • The Cslen field specifies what parts of the packet are covered by the checksum field. The checksum always covers at least the DCCP header, DCCP options, and a pseudoheader taken from the network-layer header. • The Checksum field contains the Internet checksum computed over the pseudo header, the DCCP header, any DCCP options and, depending on the Cslen value, over some of the payload. 76 CHAPTER 4. INTERNET TRANSPORT LAYER Chapter 5 Internet Application Layer The application system is not very much structured. Many protocols run directly on top of the transport protocols and it is not unusual that the various protocols solve recurring problems (such as data encoding) in different ways. Although this approach does not seem to be very efficient from a global perspective, it must be noted that the individual point solutions have been very successful in practice. There are many application layer protocols in use in the Internet. This chapter focusses on those protocols which either realize certain core services (e.g., DNS) or which carry a significant portion of the traffic (e.g., HTTP, FTP, SMTP) or which are interesting because of some of their special features. 5.1 Domain Name System (DNS) The Internet uses bit sequences (IPv4/IPv6 addresses and port numbers) to address nodes and transport endpoints. These bit sequences can be processed efficiently by machines but they are hard to remember by humans. The Domain Name System (DNS) specified in RFC 1034 [54] and RFC 1035 [55] provides a global infrastructure to map human friendly domain names into IP addresses and vice versa. virtual root nl de edu org net com biz top level iu−bremen 2nd level eecs 3rd level www 4th level Figure 5.1: Structure of Domain Name System (DNS) names • The Domain Name System provides a hierarchical name space with a virtual root. The administration of the name space can be delegated along the paths starting from the virtual root. • Name resolution is realized by so called DNS servers. A DNS server knows a part (a zone) of the global name space and its position within the global name space. Note that some parts of the name space might be further delegated to other name servers. • Name resolution queries can in principle be sent to arbitrary DNS servers. However, it is good practice to use a local DNS server as the primary DNS server. Non-local servers might be used as backup servers. • Recursive queries cause the queried DNS server to contact other DNS servers as needed in order to obtain a response to the query. This is convenient for the DNS client but requires more complexity in the DNS server. The alternative are iterative queries where the client may have to sent a series of queries to several DNS servers to retrieve the desired information. 77 78 CHAPTER 5. INTERNET APPLICATION LAYER • The original DNS protocol does not provide sufficient security. In particular, there is no guarantee that the returned response is trustworthy. (Does a request to obtain an IP address for the name www.my-bank.com really return an IP address of my bank?) Furthermore, if the DNS would be secure, then it could be used as an infrastructure to distribute for example certificates used by other security mechanisms. Although there are standards for secure DNS (RFC 2535 [56]), they are not as widely used as they should be. • Since well known DNS names became a trade item in recent years, there is quite some political debate around the question who is actually in charge to define new top-level domain names. At the moment, this responsibility lies in the hands of the Internet Corporation for Assigned Names and Numbers (ICANN). However, ICANN itself is an organization that is not everywhere well accepted. Cache Encompassing Name Server API RR DB DNS Application Resolver DNS Primary Name Server Cache Client Machine RR DB Cache DNS Server Machines Figure 5.2: Recursive name resolution with two DNS servers The global Domain Name System basically consists of three components: 1. The hierarchical name space and the so called Resource Records (RRs) that hold typed information for a given name. There are several standardized resource record types and new resource record types can be defined in order to store additional information in the domain name system. 2. A set of DNS servers which provide access to the information stored in resource records via the DNS protocol. DNS server usually also maintain some information in local caches in order to increase the overall efficiency of the DNS system. DNS server usually have authoritative information about one or multiple local zones which they are responsible for. 3. A set of resolver libraries, usually shipped as part of the operating system, which contain an implementation of a DNS client provide a programmatic API to resolve names to addresses and vice versa (see getaddrinfo() and getnameinfo()). Application programs call the resolver system library functions to resolve DNS and potentially other names. 5.1.1 Format of Domain Names The format of domain names is defined in RFC 1034 [54]. The following rules apply for traditional domain names: • The names (labels) on a certain level of the tree must be unique and may not exceed 63 byte in length. The character set for the labels is seven bit ASCII. Comparisons are done in a case-insensitive manner. • The labels must begin with a letter and end with a letter or decimal digit. The characters between the first and last character must be letters, digits or hyphens. • The labels can be concatenated with dots to form paths within the name space. Absolute paths which end at the virtual root node end with a trailing dot. All other paths which do not end with a trailing dot are relative paths. • The overall length of a domain name is limited to 255 bytes. 5.1. DOMAIN NAME SYSTEM (DNS) Type A AAAA CNAME HINFO MX NS PTR SOA KEY SIG 79 Description IPv4 addresse IPv6 addresse Alias for another name (canonical name) Identification of the CPU and the operating system (host info) List of mail server (mail exchanger) Identification of an authoritative server for a domain Pointer to another part of the name space Start and parameters of a zone (start of zone of authority) Public key associated with a name Signature over the RRs associated with a name Table 5.1: DNS resource record types Recent efforts did result in proposals for Internationalized Domain Names in Applications (IDNA) (RFC 3490 [57], RFC 3491 [58], RFC 3492 [59]). The basic idea is to support internationalized character sets within applications. However, for backward compatibility reasons, internationalized character sets are encoded into seven bit ASCII representations (ASCII Compatible Encoding, ACE). ACE labels are recognized by a so called ACE prefix. The ACE prefix for IDNA is xn--. A label which contains an encoded internationalized name might for example be the value xn--de-jg4avhby1noc0d. 5.1.2 Resource Records Information associated with a DNS name is stored in so called resource records (RRs). Every resource records has to following attributes: • The owner is the domain name which identifies a resource record. • The type indicates the kind of information that is stored in a resource record. The most important types are defined in RFC 1034 [54], RFC 1886 [60], RFC 2535 [56], and RFC 3596 [61] and is summarized in Table 5.1. • The class indicates the protocol specific name space. The DNS protocol was originally designed to support other namespaces in addition to the Internet name space. The predominant class today is IN (Internet) while some older installations also supported the class CH (Chaos System). • The time to life (TTL) defines how long (counted in seconds) information from a resource record can be stored in a local cache. • The data format (RDATA) of a resource records depends on the type of the resource record. 5.1.3 DNS Message Formats All DNS messages have the same structure. A DNS message has five parts: 1. A DNS message starts with a protocol header. It indicates which of the following four parts is presents and whether the message is a query or a response. 2. The header is followed by a list of questions. 3. The list of questions is followed by a list of answers (resource records). 4. The list of answers is followed by a list of pointers to authorities (also in the form of resource records). 5. The list of pointers to authorities is followed by a list of additional information (also in the form of resource records). This list may contain for example A resource records for names in a response to an MX query. The DNS protocol normally runs over UDP for simple queries. For larger data transfers (e.g. zone transfers), DNS may utilize TCP. Both protocols use the well-known port number 53. 80 CHAPTER 5. INTERNET APPLICATION LAYER Message Header All DNS messages start with the common header which has the following format: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ID | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |QR| Opcode |AA|TC|RD|RA| Z|AD|CD| RCODE | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | QDCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ANCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | NSCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ARCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ • The ID field contains a number which allows to correlate incoming responses with outstanding requests. • The QR bit is 0 in a query message and 1 in a response message. • The field OPCODE indicates the query type. This field is 0 for a standard query (QUERY) and 1 for an inverse query (IQUERY). • The AA bit is set if the response is authoritative (authoritative answer). • The TC bit indicates that the message is truncated due to restrictions of the transport system (truncated). • The RD bit is set on queries to start a recursive query (recursion desired). • The RA bit is set on responses and denotes whether recursive query support is available (recursion available). • The Z bit is unused. • The AD bit indicates that all data in the response has been cryptographically verified or otherwise meets the DNS server’s local security policy. • The CD bit is set when a resolver ist willing to accept non-authenticated data (checking disabled). • The RCODE field contains an error code which is only significant in response messages. • The QDCOUNT field contains the number of queries in the query list. • The ANCOUNT field contains the number of responses in the answer list. • The NSCOUNT field contains the number of authoritative name servers in the list of authoritative name servers. • The ARCOUNT field contains the number of elements in the list of additional information. Query Format The elements in the list of queries have the following format: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | | / QNAME / / / +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | QTYPE | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | QCLASS | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 5.1. DOMAIN NAME SYSTEM (DNS) 81 • The QNAME field contains the domain name which is being queried. • The 16-bit field QTYPE determines the type of the query. Some defined values are 1 (A), 2 (NS), 5 (CNAME), 6 (SOA), 12 (PTR), 13 (HINFO), 15 (MX), 24 (SIG), 25 (KEY), and 28 (AAAA). • The 16-bit field QCLASS determines the class. This field usually contains the value 1 for the Internet (IN). Response Format The list of answers, the list of authoritative servers and the list with additional information all have the same structure: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | | / / / NAME / | | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | TYPE | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | CLASS | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | TTL | | | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | RDLENGTH | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--| / RDATA / / / +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ • The NAME field contains the domain name associated with the following resource record. • The 16-bit TYPE field indicates the type ot the resource records and determines the format of the RDATA field. • The 16-bit CLASS field indicates the class. This field usually contains the value 1 for the Internet (IN). • The 16-bit TTL field contains the lifetime for the following resource record in seconds. • The 16-bit RDLENGTH field contains the length of the following RDATA field. • The RDATA field contains the actual data in form or a resource record. The format of the RDATA field depends on the resource record type indicated by the TYPE field. Resource Record Formats The various resource records have the following formats: • An A resource record contains an IPv4 address encoded in 4 bytes in network byte order. • An AAAA resource record contains an IPv6 address encoded in 16 bytes in network byte order. • A CNAME resource record contains a character string preceded by the length of the string which is encoded in the first byte (and thus restricts the string to 255 characters). • A HINFO resource record contains two character strings, each prefixed with a length byte. The first character string describes the CPU and the second string the operating system. • A MX resource record contains a 16-bit preference number (used to prioritize multiple entries) followed by a character string prefixed with a length bytes. The character string contains the DNS name of a mail exchanger. • A NS resource record contains a character string prefixed by a length byte which contains the name of an authoritative DNS server. 82 CHAPTER 5. INTERNET APPLICATION LAYER • A PTR resource record contains a character string prefixed with a length byte which contains the name of another DNS server. PTR records are used to map IP addresses to names (so called reverse lookups). For an IPv4 address of the form d1 .d2 .d3 .d4 , a PTR resource record is created for the pseudo domain name d4 .d3 .d2 .d1 .in − addr.arpa. For an IPv6 address of the form h1 h2 h3 h4 : . . . : h13 h14 h15 h16 , a PTR resource record is created for the pseudo domain name h16 .h15 .h14 .h13 . . . . .h4 .h3 .h2 .h1 .ip6.arpa • A SOA resource record contains two character strings, each prefixed by a length byte, and four 32-bit numbers. The first character string contains the name of the DNS server responsible for a zone. The second character string contains the mail address of the administrator responsible for the management of the zone. The first unsigned 32bit number contains a serial number (SERIAL) which must be incremented by the zone administrator whenever he makes changes to the zone database. The second 32-bit number defines the time which may elapse before cached zone information must be updated (REFRESH). The third 32-bit number defines a time interval after which zone information is consider not current anymore (EXPIRE). The fourth 32-bit number contains the minimum lifetime for resource records (MINIMUM). • The KEY and SIG resource records have a rather complex format which is described in detail in RFC 2535 [56]. 5.2 Abstract Syntax Notation One (ASN.1) The Abstract Syntax Notation One (ASN.1) [62, 63, 64] is a language for the definition of data structures and message formats which was developed in the 1980s. ASN.1 has been standardized by the ITU. Note that current versions of ASN.1 differ significantly from earlier versions of ASN.1. Abstract Syntax (ASN.1) Local Syntax Local Syntax De/Encoder Data Exchange De/Encoder System A System B Transfer Syntax (ASN.1 Encoding Rules) Figure 5.3: Abstract syntax, local syntax and transfer syntax ASN.1 was primarily developed to formally describe data structures that are exchanged between applications in a distributed system. The idea was to let developers focus on the definition of data structures and to give developers tools to generate the necessary encoding / decoding functions. Some more specific requirements for ASN.1 were: • Exchange of information between machines with different hardware architectures. • Independence of existing programming languages (neutrality). • Sender and receiver should be able to choose one out of multiple data encoding formats that suits their needs. The fundamental principle behind ASN.1 is the separation of the data representation during transmission from the data representation within applications that might be written in different programming languages (see Figure 5.3). 83 5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1) • The abstract syntax defines data structures in an application implementation neutral format. The abstract syntax is mapped to a local syntax which is used in a concrete implementation. The local syntax is specific for the programming language being used to develop the application and might be different for different implementations. ASN.1 compilers can be used to generate the local syntax from the abstract syntax. • The transfer syntax defines how data structures are serialized for transmission over a network. A concrete transfer syntax is commonly defined as a set of encoding rules which define how values of the various ASN.1 types are encoded. There are multiple encoding rules for ASN.1 and it is therefore necessary that applications agree on the transfer syntax to use. The implementation of the encoding and decoding functions can be automated by using ASN.1 compilers. 5.2.1 Basic Concepts ASN.1 definitions basically consist of type definitions and value definitions that are organized into ASN.1 modules. • Names of ASN.1 data types always begin with an uppercase character. • Names of ASN.1 values (constants) always begin with a lowercase character. • ASN.1 keywords and macro names only contain uppercase characters. • Comments begin with two hyphens (--) and they end either at the end of the line or at the next occurance of two hyphens (--). 5.2.2 ISO Registration Tree It is often necessary to uniquely identify artefacts such as specific protocol definitions or parameters. The ISO registration tree provides a hierarchical name system that can be used for this purpose. Note that the hierarchical structure makes it is possible to delegate authority. For example, the US Department of Defense (dod) has delegated authority over the internet subtree to the Internet Assigned Number Authority (IANA). itu−t(0) / ccitt(0) standard(0) iso(1) registration−authority(1) joint−iso−itu−t(2) / joint−iso−ccitt(2) member−body(2) identified−organization(3) dod(6) internet(1) directory(1) mgmt(2) experimental(3) mib−2(1) system(1) interfaces(2) ip(4) icmp(5) private(4) security(5) snmpDomains(1) tcp(6) udp(7) x25(5) transmission(10) snmpV2(6) snmpProxy(2) snmp(11) ... dot3(7) dot5(9) fddi(15) lapb(16) snmpModules(3) snmpMIB(1) snmpFrameworkMIB(10) ... Figure 5.4: ISO registration tree Note that nodes are uniquely identified by the assigned numbers and not necessarily by their associated descriptors. ... 84 CHAPTER 5. INTERNET APPLICATION LAYER 5.2.3 Primitive ASN.1 Data Types • The data type BOOLEAN represents the two logical values TRUE and FALSE. • The data type INTEGER represents integral numbers. Note that there is no restriction on the precision. The INTEGER type can also be used to represent named numbers. • The data type BIT STRING represents a sequence of defined bits. The length of a BIT STRING does not have to be a multiple of 8. • The data type OCTET STRING represents a sequence of octets (bytes). The OCTET STRING is a base type for character strings using different character sets or some other formatted strings such as GeneralizedTime or UTCTime. • The data type ENUMERATED represents an enumeration of values. The set of possible values must be defined when deriving a type from the ENUMERATED primitive type. • The data type OBJECT IDENTIFIER identifies a node in the ISO registration tree. An OBJECT IDENTIFIER value determines the path from the root of the registration tree to a target node. • The data type ObjectDescriptor contains a character string identifying a node in the ISO registration tree. Note that the value of an ObjectDescriptor is not guaranteed to uniquely identify a node in the ISO registration tree. • The data type ANY represents any valid ASN.1 data type. It is basically the union of all ASN.1 data types. • The data type EXTERNAL represents data structures which which have not been defined with ASN.1. This type is useful to incorporate non-ASN.1 elements in ASN.1 messages. • The data type NULL is a place holder, typically used to indicate that a value is missing in a constructed type. 5.2.4 Constructed ASN.1 Data Types • The data type SEQUENCE corresponds to structs in C or records in Modula. The order of the elements of a SEQUENCE is well defined. • The data type SET is similar to a SEQUENCE. However, the elements of a SET are not ordered. • The data type SEQUENCE OF represents an ordered set (list) of values (which have the same type). • The data type SET OF represents an unordered set (list) of values (which have the same type). • The data type CHOICE is a selection type and corresponds to unions in C or variant records in Modula. • The data type REAL is a constructed type representing floating point numbers. The mantissa and the exponent are INTEGER values. The REAL type therefore also has unlimited precision. 5.2.5 Restrictions on ASN.1 Data Types ASN.1 data types can be restricted to reduce the number of possible values. The precise syntax for the restrictions depends on the data type. Typical restrictions are size restrictions and value range restrictions. Restrictions are inherited to derived types. It is usually a good idea to introduce restricted INTEGER types that match the precision supported by typical processor hardware. Some example type restrictions are show below: Unsigned32 Integer32 Unsigned64 MacAddress InetAddress 5.2.6 ::= ::= ::= ::= ::= INTEGER (0..4294967295) INTEGER (-2147483648..2147483647) INTEGER (0..18446744073709551615) OCTET STRING (SIZE (6)) OCTET STRING (SIZE (4|16)) ASN.1 Tags Data types have associated tags which can be used to identify the type of a value during transmission. A tag consists of a tag number and a tag class. There are four different tag classes: 1. UNIVERSAL: Globally unique identification (tag assignments restricted to ASN.1 standards). 85 5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1) 2. APPLICATION: Unique identification within an application. 3. PRIVATE: Private and not universally used identification. 4. CONTEXT-SPECIFIC: Unique identification within a certain context (for example within a CHOICE). Data type BOOLEAN INTEGER BIT STRING OCTET STRING NULL OBJECT IDENTIFIER ObjectDescriptor EXTERNAL REAL ENUMERATED ... SEQUENCE, SEQUENCE OF SET, SET OF NumericString PrintableString TeletexString VideotextString IA5String UTCTime GeneralizedType GraphicsString VisibleString GeneralString CharacterString ... Tag (decimal) 1 2 3 4 5 6 7 8 9 10 ... 16 17 18 19 20 21 22 23 24 25 26 27 28 ... Table 5.2: Universal tag numbers for the core ASN.1 data types The Table 5.2 summarizes the universal tags for the fundamental ASN.1 data types which are assigned in the ASN.1 standard. 5.2.7 Example ASN.1 Definition The following example shows an ASN.1 module for a definition of the Simple Network Management Protocol version 1. The definitions are semantically identical to RFC 1155 [65] and RFC 1157 [66] but differ in some details of the notation. SNMP-VERSION-1 DEFINITIONS ::= BEGIN -- object names and types ObjectName ::= OBJECT IDENTIFIER ObjectSyntax ::= CHOICE { simple SimpleSyntax, 86 CHAPTER 5. INTERNET APPLICATION LAYER application-wide ApplicationSyntax } SimpleSyntax ::= CHOICE { integer-value INTEGER (-2147483648..2147483647), string-value OCTET STRING (SIZE (0..65535)), oid-value OBJECT IDENTIFIER, empty NULL } ApplicationSyntax ::= CHOICE { address-value NetworkAddress, counter-value Counter32, gauge-value Gauge32, timeticks-value TimeTicks, arbitrary-value Opaque } NetworkAddress ::= CHOICE { internet IpAddress } IpAddress Counter32 Gauge32 TimeTicks Opaque ::= ::= ::= ::= ::= [APPLICATION [APPLICATION [APPLICATION [APPLICATION [APPLICATION 0] 1] 2] 3] 4] IMPLICIT IMPLICIT IMPLICIT IMPLICIT IMPLICIT OCTET STRING (SIZE (4)) INTEGER (0..4294967295) INTEGER (0..4294967295) INTEGER (0..4294967295) OCTET STRING -- protocol data units Message ::= SEQUENCE { version INTEGER { version-1(0) }, community OCTET STRING, data ANY -- PDUs if trivial authentication } PDUs ::= CHOICE { get-request get-next-request get-response set-request trap } GetRequest-PDU GetNextRequest-PDU GetResponse-PDU SetRequest-PDU Trap-PDU GetRequest-PDU, GetNextRequest-PDU, GetResponse-PDU, SetRequest-PDU, Trap-PDU ::= ::= ::= ::= ::= [0] [1] [2] [3] [4] IMPLICIT IMPLICIT IMPLICIT IMPLICIT IMPLICIT PDU PDU PDU PDU TrapPDU max-bindings INTEGER ::= 2147483647 RequestID ::= INTEGER (-214783648..214783647) 5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1) ErrorIndex ::= INTEGER (0..max-bindings) ErrorStatus ::= INTEGER { noError(0), tooBig(1), noSuchName(2), badValue(3), readOnly(4), genErr(5) } VarBind ::= SEQUENCE { name ObjectName, value ObjectSyntax } VarBindList ::= SEQUENCE (SIZE (0..max-bindings)) OF VarBind PDU ::= SEQUENCE { request-id error-status error-index variable-bindings } RequestID, ErrorStatus, ErrorIndex, VarBindList TrapPDU ::= SEQUENCE { enterprise OBJECT IDENTIFIER, agent-addr NetworkAddress, generic-trap INTEGER { coldStart(0), warmStart(1), linkDown(2), linkUp(3), authenticationFailure(4), egpNeighborLoss(5), enterpriseSpecific(6) }, specific-trap INTEGER (0..214783647), time-stamp TimeTicks, variable-bindings VarBindList } END 87 88 CHAPTER 5. INTERNET APPLICATION LAYER 5.2.8 Basic Encoding Rules (BER) The Basic Encoding Rules (BER) use a tag/length/value (TLV) encoding. Every data element is identified by a tag value, the length of the data element in bytes and the data element itself. The TLV encoding enables a receiver to identify the type of every data element which can then be verified against the type the receiver expects. The price for this is a somewhat increased length of the encoded messages and it is finally an engineering decision whether the type information justifies the price. There are other ASN.1 encodings like the Packed Encoding Rules (PER) which do not generally encode type information. Encoding of Tags 8 7 6 5 4 3 2 1 tag number (< 31) tag type { primitive(0), constructed(1) } tag class { universal (00), application(01), context(10), private(11) } 8 7 6 5 4 3 2 1 1 1 1 1 1 8 7 6 5 4 3 2 1 ... 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 1 0 tag number (> 31) tag type { primitive(0), constructed(1) } tag class { universal (00), application(01), context(10), private(11) } Figure 5.5: BER encoding of tags • If the tag number is less than 31, the tag number can be encoded in a single byte. This usually covers most of the cases. If the tag number is larger, then multiple bytes must be used to encode the tag number. The highest bit indicates whether more bytes follow and hence only seven bits are used to actually encode the number. • The primitive / constructed bit can be used by the receiver to determine whether the value itself is a BER encoding, which is the case for constructed types. This allows the decoder to simply recursively call the BER decoder whenever a constructed tag has been received. Encoding of Lengths 8 7 6 5 4 3 2 1 0 length (< 128) 8 7 6 5 4 3 2 1 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 ... length (> 127) number of bytes encoding length Figure 5.6: BER encoding of the value length • A length less than 128 bytes can be encoded in a single length byte. Larger length values must be encoded in multiple bytes where the first byte indicates how many bytes are used to encode the length value. 89 5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1) • The maximum length therefore is 2(127·8) − 1 = 21016 − 1 bytes. • Besides the definite form length encoding shown here, there is also an indefinite form where the end of a value is identified by a certain marker. Encoding of BOOLEAN Values • A BOOLEAN value is encoded by a single byte. The value TRUE is represented by the byte 0xff and the value FALSE is represented by the byte 0x00. Encoding of INTEGER Values • An INTEGER value is encoded as a binary number. Negative numbers are encoded in two’s complement notation. • Note that it might be necessary to encode a leading 0x00 byte for unsigned values. For example, given the following ASN.1 type definition Unsigned8 ::= INTEGER (0..255) the decimal value 255 will be encoded using two bytes as 0x00ff. Encoding of BIT STRING Values 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 b0 b1 b2 b3 b4 b5 b6 b7 8 7 6 5 4 3 2 1 ... bits unused bits in last byte Figure 5.7: BER encoding of BIT STRING values • A BIT STRING value is encoded in a sequence of bytes that contain the named bits. The byte sequence is prefixed with a byte which indicates the number of unused bits in the last byte of the byte sequence. Encoding of OCTET STRING Values • An OCTET STRING value is encoded as a sequence of bytes. Encoding of NULL Values • A NULL value is encoded using zero bytes. A NULL value is thus encoded by its tag and the length 0. Encoding of OBJECT IDENTIFIER Values 8 7 6 5 4 3 2 1 0 sub−identifier (< 128) 8 7 6 5 4 3 2 1 1 ... 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 1 0 sub−identifier (> 127) Figure 5.8: BER encoding of OBJECT IDENTIFIER values 90 CHAPTER 5. INTERNET APPLICATION LAYER • An OBJECT IDENTIFIER value is encoded as a sequence of sub-identifiers. • The first two sub-identifiers X and Y of an OBJECT IDENTIFIER value are encoded together in a single subidentifier using the formula (X · 40) + Y . (This works since X can only hold one of the three values 0, 1 or 2.) Encoding of SEQUENCE / SEQUENCE OF / SET / SET OF Values • Values of constructed types are encoded by encoding the elements contained in a constructed type. Encoding of CHOICE Values • CHOICE values are encoded by encoding the value currently present in the CHOICE. This requires that the choice can be identifier by its tag value. Example BER Encoding The following example shows the BER encoding of an SNMP message which is consistent with the ASN.1 definition shown in Section 5.2.7. Bytes Tag 30:1b SEQUENCE { 02:01:00 INTEGER 04:06:70:75:62:6C:69:63 OCTET STRING a1:0e GetNextRequest-PDU { 02:04:36:a2:8f:07 INTEGER 02:01:00 INTEGER 02:01:00 INTEGER 30:00 SEQUENCE OF {} } } Length Value 27 1 6 14 4 1 1 0 0 "public" 916623111 0 0 Remarks • With definite length encoding, it is required that the length of the BER encoding of a value is known before the length field can be encoded. The alternative would be to reserve some space for the length and to move encoded values around if the reserved space was insufficient (or too big). This approach however is rather costly. • An alternative approach is to create the BER encoding from the innermost ASN.1 element to the outermost ASN.1 element and to construct the BER encoding from the end to the beginning. This technique works fine in some special cases but not in the general case. • During the processing of messages, it is often required to change some fields in a message before it is passed on. This is difficult to achieve in the general case since some changes can cause massive changes in the BER encoding since for example length fields have to be increased. 5.2.9 Generic String Encoding Rules (GSER) The Generic String Encoding Rules (GSER) defined in RFC 3641 [67] define a set of encoding rules that produce a human readable UTF-8 character string encoding of ASN.1 value of any given arbitrary ASN.1 type. The encoding does not use tags. Instead, values are prefixed with an identifier which is taken from the ASN.1 definition. The values contained in constructed types are usually enclosed in curly braces. 5.3. SIMPLE NETWORK MANGEMENT PROTOCOL (SNMP) 5.3 91 Simple Network Mangement Protocol (SNMP) The Simple Network Management Protocol (SNMP), developed in the late 1980s, was designed to provide standardized access to management and control information on network devices. The introduction of SNMP was motivated by the need to monitor, control and configure the evolving Internet. 5.3.1 Foundations It is necessary to first introduce some fundamental concepts and the associated terminology that is being used frequently in the network management community. Functional Areas The services provided by a network management system can be grouped into five categories: 1. Fault Management Umfa”st die Fehlererkennung, die Fehlerisolation und die Fehlerbehebung. 2. Configuration Management Umfa”st die Erzeugung und Verwaltung von Konfigurationsinformationen, die Namensverwaltung sowie Start, Kontrolle und Beendigung von Diensten. 3. Account Management Umfa”st die Erfassung von Verbrauchsdaten, die Verteilung und ”Uberwachung von Kontingenten sowie das F”uhren von Verbrauchsstatistiken. 4. Performance Management Umfa”st das Sammeln von statistischen Daten, die Ermittlung der Systemleistung und etwaige Ver”anderungen zur Leistungsoptimierung 5. Security Management Umfa”st die Erzeugung und Kontrolle von Sicherheitsdiensten, die Schl”usselgenerierung und -verteilung und die Meldung und Analyse von sicherheitsrelevanten Ereignissen. 92 CHAPTER 5. INTERNET APPLICATION LAYER 5.4 Augmented Backus-Naur Form (ABNF) Many application layer protocols (especially those running over a byte-stream oriented transport) in the Internet protocol suite use a textual format for the protocol messages instead of binary encodings. Many of these protocols define a set of textual commands that can be send to a server who responds with textual replies that typically include a structured response code in a textual decimal format. In order to reduce ambiguities when defining such textual protocols, an augmented version of the Backus-Naur Form (BNF) has been developed. The Augmented Backus-Naur Form (ABNF) defined in RFC 2234 [68] differs from standard BNF in several aspects such as naming rules, repetition, alternatives, order-independence, and value ranges. 5.4.1 Rule Names, Comments and Terminal Symbols An ABNF definition consists of a set of rules (sometimes also called productions) like any other BNF. Every rule has a name followed by an assignment operator followed by an expression consisting of terminal symbols and operators. name = expression The end of a rule is marked by the end of the line or by a comment. Comments start with the comment symbol ; (semicolon) and continue to the end of the line. • The name of a rule must start with an alphabetic character followed by a combination of alphabetics, digits and hyphens. The case of a rule name is not significant. • Terminal symbols are non-negative numbers. The basis of these numbers can be binary (b), decimal (d) or hexadecimal (x). Multiple values can be concatenated by using the dot . as a value concatenation operator. It is also possible to define ranges of consecutive values by using the hyphen - as a value range operator. • Terminal symbols can also be defined by using literal text strings containing US ASCII characters enclosed in double quotes. Note that these literal text strings are case-insensitive. Some examples to demonstrate these core ABNF concepts: CR = %d13 CRLF = %d13.10 DIGIT = %x30-39 ; ASCII carriage return code in decimal ; ASCII carriage return and linefeed code sequence ; ASCCI digits (0 - 9) ABA = "aba" ; ASCII string "aba" or "ABA" or "Aba" or ... abba = %x61.62.62.61 ; ASCII string "abba" Note that ABNF does not define a module concept and import/export mechanism which could be used to import rules from other modules into the current module. This reflects the fact that ABNF is mostly used for documentation purposes rather than code generation purposes. 5.4.2 Operators The right hand side expressions of ABNF rules use a set of different operators. The ABNF operators are briefly described in the text below. Concatenation The simplest operator is the concatenation operator. The operator symbol is the empty word. (Note that white space characters are not significant in ABNF.) The rule abba = %x61 %x62 %x62 %x61 ; ASCII string "abba" 5.4. AUGMENTED BACKUS-NAUR FORM (ABNF) 93 is therefore equivalent to the rule abba = %x61.62.62.61 ; ASCII string "abba" and the terminal symbol concatenation feature is thus just a shortcut to make definitions more compact. Alternatives Elements separated by alternatives operator / (forward slash) are alternatives. A rule of the form mumble = foo / bar results in the elements defined either by the rule foo or the rule bar. Note that the alternatives operator can be used to emulate the value range operator by spelling out all numbers in the range: DIGIT = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" Since it is often required to specify long lists of alternatives (and since ABNF is line oriented), there is an incremental version of the alternatives operator which combines the alternatives operator with the assignment symbol: DIGIT =/ "0" / "1" / "2" / "3" DIGIT =/ "4" / "5" / "6" / "7" DIGIT =/ "8" / "9" Grouping An expression can be grouped using parenthesis. A grouped expression is treated as a single element, regardless of the internal structure of the group’s expression. It is recommended to use groups instead of relying on operator precendences whenever there is a good chance to misread a rule. Repetitions Repetitions are specified using the parameterized repetition operator * (star). The full format for the operator is n*m where n and m are optional decimal values. The value n indicates the minimum number of repetitions (defaults to 0 if not present) and the value m indicates the maximum number of repetitions (defaults to infinity if not present). There are two notations for special cases that appear frequently enough to justify the introduction of a special notation: • The notation n foo indicates that foo appears exactly n times. This is equivalent to n*n foo. • Optional elements can be written in square brackets as [foo] which is equivalent to *1 foo. Note that abba can now be written as follows: abba = %x61 2 %x62 %x61 Operator Precedence The precedence of the operators from highest (binding tightest) at the top, to lowest and loosest at bottom: Repetition Grouping, Optional Concatenation Alternative It is generally recommended that the grouping operator be used to make explicit groups in order to avoid any potential confusion. 94 5.4.3 CHAPTER 5. INTERNET APPLICATION LAYER Core Definitions The following core definitions are taken from the ABNF specification in RFC 2234 [68]. They are frequently used in other ABNF definitions. ALPHA = %x41-5A / %x61-7A ; A-Z / a-z BIT = "0" / "1" CHAR = %x01-7F ; any 7-bit US-ASCII character, excluding NUL CR = %x0D ; carriage return CRLF = CR LF ; Internet standard newline CTL = %x00-1F / %x7F ; controls DIGIT = %x30-39 ; 0-9 DQUOTE = %x22 ; " (Double Quote) HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F" HTAB = %x09 ; horizontal tab LF = %x0A ; linefeed LWSP = *(WSP / CRLF WSP) ; linear white space (past newline) OCTET = %x00-FF ; 8 bits of data SP = %x20 ; space VCHAR = %x21-7E ; visible (printing) characters WSP = SP / HTAB ; White space 5.4.4 ABNF in ABNF The syntax of ABNF can be specified in ABNF itself. This is again copied from from RFC 2234 [68]. 5.4. AUGMENTED BACKUS-NAUR FORM (ABNF) 95 rulelist = 1*( rule / (*c-wsp c-nl) ) rule = rulename defined-as elements c-nl ; continues if next line starts with white space rulename = ALPHA *(ALPHA / DIGIT / "-") defined-as = *c-wsp ("=" / "=/") *c-wsp ; basic rules definition and incremental alternatives elements = alternation *c-wsp c-wsp = WSP / (c-nl WSP) c-nl = comment / CRLF ; comment or newline comment = ";" *(WSP / VCHAR) CRLF alternation = concatenation *(*c-wsp "/" *c-wsp concatenation) concatenation = repetition *(1*c-wsp repetition) repetition = [repeat] element repeat = 1*DIGIT / (*DIGIT "*" *DIGIT) element = rulename / group / option / char-val / num-val / prose-val group = "(" *c-wsp alternation *c-wsp ")" option = "[" *c-wsp alternation *c-wsp "]" char-val = DQUOTE *(%x20-21 / %x23-7E) DQUOTE ; quoted string of SP and VCHAR without DQUOTE num-val = "%" (bin-val / dec-val / hex-val) bin-val = "b" 1*BIT [ 1*("." 1*BIT) / ("-" 1*BIT) ] ; series of concatenated bit values ; or single ONEOF range dec-val = "d" 1*DIGIT [ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ] hex-val = "x" 1*HEXDIG [ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ] prose-val = "<" *(%x20-3D / %x3F-7E) ">" ; bracketed string of SP and VCHAR without angles ; prose description, to be used as last resort 96 CHAPTER 5. INTERNET APPLICATION LAYER 5.5 Simple Mail Transfer Protocol (SMTP) A very fundamental service in the Internet is electronic mail (email). The Simple Mail Transfer Protocol (SMTP) originally specified in RFC 821 [69] and currently documented in RFC 2821 [70] is the primary protocol for mail transport and delivery in the Internet. SMTP usually runs over TCP and uses the well-known port number 25. 5.5.1 Grundlagen Elektronische Post wird auch nach dem store-and-forward-Prinzip weitergeleitet bei dem jeweils ein Knoten die aktuelle Kopie einer Nachricht hat und Verantwortung f”ur deren Weiterleitung ”ubernimmt. Bei der Weiterleitung wird zun”achst auf dem n”achsten Knoten eine neue Kopie erzeugt. Ist der Vorgang erfolgreich beendet worden, so kann die lokale Kopie vernichtet werden, da der n”achste Knoten jetzt f”ur die Zustellung zust”andig ist. Eine typische Konfiguration zum Versenden, Weiterleiten und Lesen von elektronischer Post im Internet zeigt Abbildung 5.9. sender host system sender host system mail queue user agent local MTA user agent SMTP mail queue relay MTA organization A SMTP organization B mail queue relay MTA user mailbox SMTP user agent user mailbox receiver host system IMAP local MTA user agent receiver host system Figure 5.9: Typische Konfigurationen f”ur elektronische Post • Der mail user agent (MUA) f”uhrt den Dialog mit dem Benutzer und nimmt eine neue elektronische Nachricht entgegen. Die Nachricht wird anschlie”send entweder dem lokalen MTA oder einem relay MTA (siehe unten) ”ubergeben. • Ein mail transfer agent (MTA) ist f”ur die Weiterleitung von elektronischen Nachrichten durch das Internet zust”andig. Ein relay MTA ist ein Zwischensystem das einzig zur Weiterleitung von Nachrichten im Internet dient. Ein gateway MTA ist ein Zwischensystem das zur Weiterleitung von Nachrichten in andere Netze dient (z.B. ISO/OSI Netzwerke auf der Basis des X.400 Protokolls). 5.5. SIMPLE MAIL TRANSFER PROTOCOL (SMTP) 97 • Am Zielsystem angekommen wird die elektronische Post in einem benutzerspezifischen Zwischenspeicher (mailbox) abgelegt. Der Zugriff auf diesen Zwischenspeicher erfolgt entweder ”uber das Dateisystem oder mit Hilfe spezieller Protokolle wie z.B. IMAP (siehe n”achstes Kapitel). • Bei der ”Ubertragung von elektronischer Post unterscheidet man analog zu der gelben Post zwischen einem Umschlag (envelop) der f”ur die Weiterleitung und Zustellung einer Nachricht wichtig ist, einem Nachrichtenkopf (header) der allgemeine Parameter einer Nachricht beschreibt und dem eigentlichen Inhalt (body). • Der Inhalt (body) ist zun”achst nicht weiter struktuiert und auf die 7-Bit US-ASCII Zeichen reduziert. Zus”atzliche Standards beschreiben wie der Inhalt strukturiert werden kann und insbesondere mehrteilige Dokumente mit verschiedenen Dokumenttypen realisiert werden (MIME). 5.5.2 Kommandos und Antworten Die ”Ubertragung von elektronischer Post erfolgt mit Hilfe des Simple Mail Transfer Protocols (SMTP) [70]. Das Protokoll definiert eine relativ kleine Menge von Kommandos die von einem SMTP Client aufgerufen und vom SMTP Server ausgef”uhrt werden. Die Ausf”uhrung wird durch numerische Ergebniscodes best”atigt. Identifikation eines SMTP Clients gegen”uber einen SMTP Server (HELLO) Erweiterte Identifikation mit Anzeige unterst”utzter SMTP Optionen (EXTENDED HELLO) Start einer Nachrichten”ubertragung (MAIL) Identifikation eines Empf”angers (RECIPIENT) Beginn der ”Ubertragung des Inhalts (DATA) Abbruch einer begonnenen Transaktion (RESET) ”Uberpr”ufung einer Adresse (VERIFY) Aufl”osung einer Mailing-Liste (EXPAND) Hilfe ”uber die verf”ugbaren Kommandos (HELP) Kommando ohne Wirkung (NOOP) Beendung der Transportverbindung (QUIT) HELO EHLO MAIL RCPT DATA RSET VRFY EXPN HELP NOOP QUIT Die genaue Syntax der SMTP-Kommandos ist in ABNF definiert: ; ; Syntax of the SMTP commands: ; helo ehlo mail rcpt = = = = data rset vrfy expn help noop quit = = = = = = = "HELO" SP Domain CRLF "EHLO" SP Domain CRLF "MAIL FROM:" ("<>" / Reverse-Path) [SP Mail-Parameters] CRLF "RCPT TO:" ("<Postmaster@" domain ">" / "<Postmaster>" / Forward-Path) [SP Rcpt-Parameters] CRLF "DATA" CRLF "RSET" CRLF "VRFY" SP String CRLF "EXPN" SP String CRLF "HELP" [ SP String ] CRLF "NOOP" [ SP String ] CRLF "QUIT" CRLF ; ; Syntax of the SMTP command paramters: ; Reverse-path Forward-path Path A-d-l ; Note = Path = Path = "<" [ A-d-l ":" ] Mailbox ">" = At-domain *( "," A-d-l ) that this form, the so-called "source route", 98 CHAPTER 5. INTERNET APPLICATION LAYER ; MUST BE accepted, SHOULD NOT be generated, and SHOULD be ; ignored. At-domain = "@" domain Mail-parameters = esmtp-param *(SP esmtp-param) Rcpt-parameters = esmtp-param *(SP esmtp-param) esmtp-param = esmtp-keyword ["=" esmtp-value] esmtp-keyword = (ALPHA / DIGIT) *(ALPHA / DIGIT / "-") esmtp-value = 1*(%d33-60 / %d62-127) ; any CHAR excluding "=", SP, and control characters Keyword = Ldh-str Argument = Atom Domain = (sub-domain 1*("." sub-domain)) / address-literal sub-domain = Let-dig [Ldh-str] address-literal = "[" IPv4-address-literal / IPv6-address-literal / General-address-literal "]" Mailbox = Local-part "@" Domain Local-part = Dot-string / Quoted-string Dot-string = Atom *("." Atom) Atom = 1*atext Quoted-string = DQUOTE *qcontent DQUOTE String = Atom / Quoted-string ; ; Syntax for IPv4/IPv6 addresses: ; IPv4-address-literal = Snum 3("." Snum) IPv6-address-literal = "IPv6:" IPv6-addr General-address-literal = Standardized-tag ":" 1*dcontent Standardized-tag = Ldh-str ; MUST be specified in a standards-track RFC ; and registered with IANA Snum = 1*3DIGIT ; representing a decimal integer ; value in the range 0 through 255 Let-dig = ALPHA / DIGIT Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig IPv6-addr IPv6-hex IPv6-full IPv6-comp = IPv6-full / IPv6-comp / IPv6v4-full / IPv6v4-comp = 1*4HEXDIG = IPv6-hex 7(":" IPv6-hex) = [IPv6-hex *5(":" IPv6-hex)] "::" [IPv6-hex *5(":" IPv6-hex)] 5.5. SIMPLE MAIL TRANSFER PROTOCOL (SMTP) 99 ; The "::" represents at least 2 16-bit groups of zeros ; No more than 6 groups in addition to the "::" may be ; present IPv6v4-full = IPv6-hex 5(":" IPv6-hex) ":" IPv4-address-literal IPv6v4-comp = [IPv6-hex *3(":" IPv6-hex)] "::" [IPv6-hex *3(":" IPv6-hex) ":"] IPv4-address-literal ; The "::" represents at least 2 16-bit groups of zeros ; No more than 4 groups in addition to the "::" and ; IPv4-address-literal may be present Als Reaktion auf die Kommandos schickt der Server dreistellige numerische Antwortcodes mit zus”atzlichen textuellen Erkl”arungen, die allerdings nicht f”ur das Protokoll signifikant sind. Die Antwortcodes sind nach einem festen Muster aufgebaut (theory of reply codes). • Die erste Stelle des dreistelligen Antwortcodes gibt dar”uber Auskunft, um was f”ur eine Art von Antwortcode es sich handelt: 1yz Vorl”aufige positive Antwort, wobei zur Ausf”uhrung der Aktion weitere Informationen notwendig sind (positive preliminary reply). 2yz Endg”ultige positive Antwort ”uber die erfolgreiche Ausf”uhrung einer Aktion (positive completion reply). 3yz Zwischenzeitliche positive Antwort, wobei weitere Informationen zur Beendigung einer Aktion notwendig sind (positive intermediate reply). 4yz Transiente negative Antwort, wobei der Fehler tempor”arer Natur ist und das Kommando wiederholt werden kann (transient negative completion reply). 5yz Endg”ultige negative Antwort, wobei eine automatische Wiederholung des Kommandos nicht sinnvoll ist (permanent negative completion reply). • Mit der zweiten Stelle gruppiert man Antworten in spezielle Kategorien: x0z Syntaktische Probleme. x1z Informelle Antworten und Statusinformationen. x2z Antworten, die sich auf den ”Ubertragungskanal beziehen. x3z Nicht definiert. x4z Nicht definiert. x5z Status des Servers im Kontext der eingeleiteten Aktionen. • Die dritte Stelle gibt die genaue Bedeutung der Antwort in der jeweiligen Kategorie an. Durch die Benutzung dieser dreistufigen Antwortcodes kann eine Implementation auch auf unbekannte neue Antwortcodes relativ sinnvoll reagieren. Die im SMTP-Protokoll definierten Antwortcodes sind unten nach funktionalen Kriterien geordnet angegeben: 500 501 502 503 504 Syntax error, command unrecognized Syntax error in parameters or arguments Command not implemented (see section 4.2.4) Bad sequence of commands Command parameter not implemented 211 System status, or system help reply 214 Help message 220 <domain> Service ready 221 <domain> Service closing transmission channel 421 <domain> Service not available, closing transmission channel 250 Requested mail action okay, completed 251 User not local; will forward to <forward-path> 100 252 450 550 451 551 452 552 553 354 554 CHAPTER 5. INTERNET APPLICATION LAYER Cannot VRFY user, but will accept message and attempt delivery Requested mail action not taken: mailbox unavailable Requested action not taken: mailbox unavailable Requested action aborted: error in processing User not local; please try <forward-path> Requested action not taken: insufficient system storage Requested mail action aborted: exceeded storage allocation Requested action not taken: mailbox name not allowed Start mail input; end with <CRLF>.<CRLF> Transaction failed (Or, in the case of a connection-opening response, "No SMTP service here") 5.5.3 Nachrichtenk”opfe Das Format der Nachrichtenk”opfe ist in RFC 2822 [?] festgelegt. Die wesentliche Produktion der ABNF sieht folgenderma”sen aus: fields = *(trace *resent-field) *regular-field resend-field resend-field = resent-date / resent-from / resent-sender / =/ resent-to / resent-cc / resent-bcc / resent-msg-id regular-field regular-field regular-field = orig-date / from / sender / reply-to / to / cc / bcc =/ message-id / in-reply-to / references / subject =/ comments / keywords • Die resend-field Elemente werden bei der ”Ubertragung von Nachrichten von MTAs erzeugt. Sie dienen der Fehlersuche und der Erkennung von Schleifen. • Die ”ubrigen Felder haben die vermutlich mittlerweile allgemein bekannten Bedeutungen. 5.5.4 Multipurpose Internet Mail Extensions (MIME) Die Multipurpose Internet Mail Extensions (MIME) [?] sind Konventionen, mit denen der Inhalt einer Nachricht strukturiert werden kann und insbesondere auch die ”Ubertragung verschiedener Dokumenttypen (inklusive bin”arer Format) erm”oglicht wird. Zus”atzliche Nachrichtenkopffelder Die MIME-Spezifikationen definieren einige neue Felder f”ur den Nachrichtenkopf: • Das Feld Mime-Version definiert die MIME-Versionsnummer (derzeit 1.0). • Das Feld Content-Type beschreibt, welchen Medientyp die Nachricht enth”alt. Es gibt zusammengesetzte Medientypen, bei denen jedes enhaltene Dokument selbst seinen Content-Type beschreibt. • Das Feld Content-Transfer-Encoding legt fest, wie die Daten w”ahrend der ”Ubertragung kodiert werden. Dies ist notwendig, da SMTP klassisch nur 7-Bit ASCII verlangt und um Mehrdeutigkeiten zu vermeiden. • Das optionale Feld em Content-Description enth”alt eine kurze Beschreibung und ist insbesondere dann sinnvoll, wenn nicht garantiert ist, da”s ein Empf”anger in der Lage ist den Dokumenttypen auszugeben. Medientypen MIME unterscheidet f”unf primitive Medientypen (media types): 1. Der Medientyp (text) kann f”ur beliebigen Text verwendet werden. Der einfachste Untertyp ist plain. Beim Medientyp (text) kann ”uber Parameter der Zeichensatz (charset) angegeben werden. 101 5.5. SIMPLE MAIL TRANSFER PROTOCOL (SMTP) 2. Der Medientyp (image) kann f”ur beliebige Bildformate verwendet werden. Typische Untertypen sind jpeg oder png. 3. Der Medientyp (audio) steht f”ur ein Dokument, das einen Audiokanal zur Ausgabe ben”otigt. 4. Der Medientyp (video) steht f”ur Sequenzen von bewegten Bilddaten. Ein typischer Untertyp ist mpeg. 5. Der Medientyp (application) steht f”ur Datenformate, die von bestimmten Applikationen verstanden werden. Typische Vertreter sind postscript oder pdf. Neben den primitiven Medientypen gibt es auch zusammengesetzte Medientypen: 1. Der Medientyp (multipart) wird f”ur Dokumente benutzt, die aus Teilen mit jeweils unterschiedlichen Medientypen bestehen. 2. Der Medientyp (message) kann selbst wieder eine Nachricht enthalten, wobei diese Nachricht allerdings nicht selbst vom Typ (message) sein darf. Bei zusammengesetzten Inhalten werden die einzelnen Teil durch Markierungen voneinander getrennt, die jeweils an Anfang der Zeile mit zwei Minuszeichen (--) beginnen. Au”serdem mu”s im Feld Content-Type mit dem Parameter boundary eine zus”atzliche Trennzeichenfolge festgelegt werden, die nach den zwei Minuszeichen (--) folgen mu”s. Die letzte Markierung wird zus”atzlich durch zwei Minuszeichen (--) abgeschlossen. From: Nathaniel Borenstein <nsb@bellcore.com> To: Ned Freed <ned@innosoft.com> Date: Sun, 21 Mar 1993 23:56:48 -0800 (PST) Subject: Sample message MIME-Version: 1.0 Content-type: multipart/mixed; boundary="simple boundary" This is the preamble. It is to be ignored, though it is a handy place for composition agents to include an explanatory note to non-MIME conformant readers. --simple boundary This is implicitly typed plain US-ASCII text. It does NOT end with a linebreak. --simple boundary Content-type: text/plain; charset=us-ascii This is explicitly typed plain US-ASCII text. It DOES end with a linebreak. --simple boundary-This is the epilogue. It is also to be ignored. Base64 Encoding Bei dieser Kodierung werden jeweils drei Byte (also 24 Bit) durch ein vier Zeichen dargestellt, die aus einem 6-Bit Zeichenvorrat entnommen werden. Die sich ergebende Zeichenfolge wird so umgebrochen, da”s Zeilen niemals l”anger sind als 76 Zeichen. Value 0 1 2 3 4 Encoding A B C D E Value 17 18 19 20 21 Encoding R S T U V Value 34 35 36 37 38 Encoding i j k l m Value 51 52 53 54 55 Encoding z 0 1 2 3 102 CHAPTER 5. INTERNET APPLICATION LAYER 5 6 7 8 9 10 11 12 13 14 15 16 F G H I J K L M N O P Q 22 23 24 25 26 27 28 29 30 31 32 33 W X Y Z a b c d e f g h 39 40 41 42 43 44 45 46 47 48 49 50 n o p q r s t u v w x y 56 57 58 59 60 61 62 63 4 5 6 7 8 9 + / (pad) = Sollten bei der Kodierung am Ende weniger als drei Byte ”ubrig bleiben, so werden F”ullbytes angeh”angt und das besondere Zeichen = an die Kodierung angef”ugt, um die Anzahl der benutzten F”ullbytes anzuzeigen. Offensichtlich ist ein Text nach einer Base64-Kodierung nicht mehr ohne weitere Hilfsmittel lesbar. Quoted-Printable Encoding Beim Quoted-Printable Encoding wird versucht, soviel wie m”oglich von dem darstellbaren Text zu erhalten. Nur Zeichen, die nicht in US-ASCII darstellbar sind, werden besonders kodiert. Das grundlegende Prinzip ist es, nicht darstellbare Zeichen durch die entsprechende Hexadezimalzahl darzustellen, wobei das Zeichen = zur Identifikation benutzt wird (=0A=0D). Die genauen Regel sind in Wirklichkeit relativ komplex. Siehe RFC 2045 [?] f”ur die Details. 5.6 Internet Message Access Protocol (IMAP) Das Internet Message Access Protocol (IMAP) [?] erlaubt den Zugriff auf Nachrichten, die auf einem Server physikalisch gespeichert werden. Dabei k”onnen verschiedenen Zwischenspeicher (folder, mailboxes) verwaltet werden und es ist auch ein offline Betrieb mit sp”aterer Synchronisation m”oglich. (IMAP erlaubt au”serdem auch den Zugriff auf Usenet News, indem News-Gruppen als weitere Zwischenspeicher betrachtet werden.) Folgende Funktionen werden im Detail unterst”utzt: • Erzeugen, l”oschen und umbenennen von Zwischenspeichern. • Pr”ufen auf neu eingegangene elektronische Post. • L”oschen von elektronischer Port. • Setzen und L”oschen von Markierungen f”ur elektronische Post. • Suchen nach bestimmter elektronischer Post. • Selektiver Zugriff auf Attribute und Inhalte von Teilen der elektronischen Post. IMAP basiert in der Regel auf TCP und benutzt standardm”a”sig die Portnummer 143. Da urspr”unglich die Authentifizierung durch die ”Ubertragung von Pa”sworten im Klartext geschah, wird oftmals IMAP ”uber TLS bzw. SSL verwendet. Interaktionen zwischen einem Client und einem Server sind in der Regel zeilenorientiert, wobei Zeilen durch ein CRLF abgeschlossen werden. 5.6.1 Identifikation und Zust”ande Nachrichten werden in verschiedenen Zwischenspeichern verwaltet. Die Zwischenspeichern werden ”uber Namen identifiziert, wobei ein hierarchischer Namensraum unterst”utzt wird. Die Nachrichten in einem Zwischenspeichern werden ”uber Nummern identifiziert: • Die Positionsnummer einer Nachricht message sequence number ist die relative Position einer Nachricht in einem Zwischenspeicher (gez”ahlt wird ab 1). Die Positionsnummer einer Nachricht kann sich durch L”oschungen und Einf”ugungen ver”andern. 5.6. INTERNET MESSAGE ACCESS PROTOCOL (IMAP) 103 • Die Identifikationsnummer einer Nachricht unique identifier identifiziert eine Nachricht in einem Zwischenspeicher unabh”angig von der Position der Nachricht im Zwischenspeicher. Die Identifikationsnummern bleiben ”uber mehrere IMAP-Sitzungen erhalten und erlauben daher die Synchronisierung beim off-line Betrieb. Nat”urlich k”onnen auch bei der Verwendung von Identifikationsnummern Probleme auftreten, insbesondere wenn externe Programme einen Zwischenspeicher umorganisieren oder ganze Zwischenspeicher gel”oscht und neue mit demselben Namen angelegt werden. Daher gibt es einen zus”atzlichen globalen Z”ahler, der bei solchen Ereignissen inkrementiert wird und mit dem angezeigt wird, da”s die aktuellen Identifikationsnummern nicht mehr mit alten gespeicherten Identifikationsnummern identisch sein m”ussen. 5.6.2 Zust”ande Das IMAP-Protokoll unterscheidet verschiedene Zust”ande. Abh”angig vom jeweiligen Zustand stehen einem Client unterschiedliche Kommandos zur Verf”ugung. connection established and server greeting non−authenticated authenticated selected logout and connection release Figure 5.10: IMAP Zustandsdiagramm • Der Startzustand connection established and server greeting wird nach dem Aufbau der Transportverbindung angenommen. Der Server schickt in diesem Zustand eine Begr”u”sungsnachricht. • Anschlie”send findet normalerweise eine Transition in den Zustand non-authenticated statt. In diesem Zustand sind im wesentlichen nur die Kommandos zul”assig, die zu einer Authentifikation notwendig sind. • Nach erfolgreicher Authentifikation findet eine Transition in den Zustand authenticated statt. In diesem Zustand kann im wesentlichen ein Zwischenspeichern zur weiteren Bearbeitung ausgew”ahlt werden. • Im Zustand selected kann auf den Inhalt des selektierten Zwischenspeichers zugegriffen werden, und es k”onnen ”Anderungen vorgenommen werden. Man kann den selektierten Zwischenspeichers wieder freigeben (Transition in den authenticated Zustand oder aber auch die IMAP-Sitzung beeinden. • Im Zustand logout and connection release wird die Sitzung beendet und die Transportverbindung ordnungsgem”a”s abgebaut. 5.6.3 Kommandos Die einzelnen IMAP-Kommandos sind immmer nur in den verschiedenen Zust”anden erlaubt. Die folgende Liste der Kommandos ist daher nach den jeweiligen Zust”anden sortiert: 104 CHAPTER 5. INTERNET APPLICATION LAYER Beliebiger Zustand CAPABILITY NOOP LOGOUT Liefert eine Liste der Faehigkeiten des Servers Leeres Kommando (kann f”ur Statusaktualisierungen benutzt werden) Beendigung der IMAP-Sitzung Zustand non-authenticated AUTHENTICATE LOGIN Auswahl eines Authentifizierungsverfahrens Triviale Authentifizierung mit einem Klartext Pa”swort Zustand authenticated SELECT EXAMINE CREATE DELETE RENAME SUBSCRIBE UNSUBSCRIBE LIST LSUB STATUS APPEND Auswahl eines Zwischenspeichers (read-write) Auswahl eines Zwischenspeichers (read-only) Anlegen eines neuen Zwischenspeichers L”oschen eines neuen Zwischenspeichers Umbenennen eines Zwischenspeichers Eintrag eines Zwischenspeichers in die Liste der aktiven Zwischenspeicher Austragung eines Zwischenspeichers aus der Liste der aktiven Zwischenspeicher Auflisten der Namen der Zwischenspeicher Auflisten der Namen der aktiven Zwischenspeicher Statusabfrage von einem Zwischenspeicher Anf”ugen von Daten an einen Zwischenspeicher Zustand selected CHECK CLOSE EXPUNGE SEARCH FETCH STORE COPY UID 5.6.4 Anlegen einer Sicherungskopie des Zwischenspeiches Schlie”sen des aktuellen Zwischenspeichers L”oschen aller Nachrichten, die zum L”oschen markiert sind. Suchen von Nachrichten, die bestimmte Kriterien erf”ullen Lesen von Daten einer Nachricht aus dem Zwischenspeicher ”Andern von Daten einer Nachricht in dem Zwischenspeicher Kopieren von Nachrichten an das Ende eines Zwischenspeichers Tagging IMAP unterst”utzt nebenl”aufige Operationen auf dem Server. Ein Client kann also mehrere Kommandos absetzen, die dann vom Server asynchron ausgef”uhrt werden. Ein Client versieht seine Kommandos daher mit eindeutigen Tags, um sp”ater Antworten den verschiedenen Kommandos zuordnen zu k”onnen. Die Syntax eines Kommandos ist damit: command = tag SPACE (command_any / command_auth / command_nonauth / command_select) CRLF Der Server antwortet auf Kommandos mit Antworten. Dabei werden wiederum drei Arten von Antwortzeilen unterschieden: 1. Antwortzeilen, die weitere Informationen zur Bearbeitung eines Kommandos anfordern, werden durch ein + markiert. 2. Antwortzeilen, die eine Statusmeldung beinhalten und nicht das Ende einer Kommandobearbeitung implizieren, werden durch ein * markiert. 3. Alle anderen Antworten beginnen mit dem vom Client gesetzten Tag und zeigen die (erfolgreiche oder erfolglose) Beendigung der Bearbeitung eines Kommandos an. Die R”uckmeldungen ”uber die erfolgreiche/erfolglose Bearbeitung erfolgt durch Schl”usselworte (OK, NO, BAD) und nicht wie beim SMTP durch strukturierte Antwortcodes. 5.6. INTERNET MESSAGE ACCESS PROTOCOL (IMAP) 5.6.5 105 Nachrichtenformat Das IMAP-Nachrichtenformat selbst ist wiederum in ABNF beschrieben. Einen Ausschnitt der ABNF-Definitionen fuer die grundlegensten Elemente ist hier angegeben. F”ur die recht umfangreiche vollst”andige Version sei jedoch auch RFC 2060 [?] verwiesen. tag = 1*<any ATOM_CHAR except "+"> ; ; top-level productions for IMAP commands: ; command = tag SPACE (command_any / command_auth / command_nonauth / command_select) CRLF ;; Modal based on state command_any = "CAPABILITY" / "LOGOUT" / "NOOP" / x_command ;; Valid in all states command_auth = append / create / delete / examine / list / lsub / rename / select / status / subscribe / unsubscribe ;; Valid only in Authenticated or Selected state command_nonauth = login / authenticate ;; Valid only when in Non-Authenticated state command_select = "CHECK" / "CLOSE" / "EXPUNGE" / copy / fetch / store / uid / search ;; Valid only when in Selected state ; ; top-level productions for IMAP responses: ; response = *(continue_req / response_data) response_done continue_req = "+" SPACE (resp_text / base64) response_data = "*" SPACE (resp_cond_state / resp_cond_bye / mailbox_data / message_data / capability_data) CRLF response_done = response_tagged / response_fatal response_fatal = "*" SPACE resp_cond_bye CRLF ;; Server closes connection immediately response_tagged = tag SPACE resp_cond_state CRLF 106 CHAPTER 5. INTERNET APPLICATION LAYER 5.7 File Transfer Protocol (FTP) The File Transfer Protocol (FTP) defined in RFC 959 [71] is one of the very old and elementary protocols of the Internet. It can be used to transfer files between nodes connected to the Internet. The protocol uses a separate TCP connection for each data transfer. This approach sidesteps any problems about marking the end of files. On the other hand, using separate TCP connections is not very efficient for small files, due to the connection establishment/tear-down overhead and the fact that flow control and congestion state needs to be build up for each new connection. user interface control process file system data transfer process FTP Server commands replies control process data data transfer process file system FTP Client Figure 5.11: FTP interaction model Remarks: • The control connection uses a text-based line-oriented protocol which is similar to SMTP. The client sends commands which are processed by the server. The server sends responses using three digit response codes. • A separate TCP connection is established for each data transfer. The connection can be initiated either from the client of the server. If the data transfer connection is initiated by the server, then the client’s port number must be conveyed first to the server. If the data transfer connection is initiated by the client, then the well-known port number 20 is used. The well-known port number for the control connection is 21. • FTP allows to resume a data transmission that did not complete using a special restart mechanism. • FTP can be used to initiate a data transfer between two remote systems. However, this feature of FTP can result in some interesting security problems. For more details, consult RFC 2577 [72]. 5.8. HYPERTEXT TRANSFER PROTOCOL (HTTP) 5.8 107 Hypertext Transfer Protocol (HTTP) The Hypertext Transfer Protocol (HTTP) is defined in RFC 2616 [73] and one of the core building blocks of the World Wide Web. HTTP is a simple request/response protocol primarily used to exchange documents between clients (browsers) and servers. The HTTP protocol runs on top of TCP and it uses the well known port number 80. HTTP utilizes MIME conventions in order to distinguish different media types. Documents are identified using Uniform Resource Identifier (URIs) as defined in RFC 2396 [74]. HTTP provides a fixed set of methods that can be applied to documents identified by a URI. The current version of HTTP supports the methods shown in Table 5.3. Method OPTIONS GET HEAD POST PUT DELETE TRACE CONNECT Description Request information about the communication options available Retrieve whatever information is identified by a URI Retrieve only the meta-information which is identified by a URI Annotate an existing resource or pass data to a data-handling process Store information under the supplied URI Delete the resource identified by the URI Application-layer loopback of request messages for testing purposes Initiate a tunnel such as a TLS or SSL tunnel Table 5.3: HTTP 1.1 methods The principal structure of HTTP messages is described in the following simplified ABNF. A Message is either a Request or a Response. They syntactically only differ in the first line, which is either a Request-Line or a Status-Line. Message = Request / Response Request Response = = Request-Line *(message-header CRLF) CRLF [ message-body ] Status-Line *(message-header CRLF) CRLF [ message-body ] Request-Line = Method SP Request-URI SP HTTP-Version CRLF Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF Method Method Method = "OPTIONS" / "GET" / "HEAD" / "POST" =/ "PUT" / "DELETE" / "TRACE" / "CONNECT" =/ Extension-Method Extension-Method = token Request-URI HTTP-Version Status-Code Reason-Phrase = = = = "*" | absoluteURI | abs_path | authority "HTTP" "/" 1*DIGIT "." 1*DIGIT 3DIGIT *<TEXT, excluding CR, LF> message-header field-name field-value field-content = = = = field-name ":" [ field-value ] token *( field-content / LWS ) <the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, separators, and quoted-string> 108 5.8.1 CHAPTER 5. INTERNET APPLICATION LAYER Persistent Connections and Pipelining HTTP 1.1 supports persistent connections. A client can establish a connection to a server and use it to send multiple Request messages. Earlier version of HTTP allowed only a single Request/Response exchange over a single connection, which is of course rather expensive. To use a connection for multiple requests, it is important to detect the end of a message body (document). HTTP relies on the MIME Content-Length header field for this purpose. If for some reason the server does not know the length before starting to send the response (typically the case for dynamic pages that are constructed on the fly), then the server may choose to close the connection to indicate the end of the message body. HTTP 1.1 also allows clients to make multiple requests without waiting for each response (pipelining), which can significantly reduce latency. Web pages typically consist of an HTML document which has links to many small icons and other elements. Being able to retrieve all these referenced elements over a single connection in a pipelined mode clearly significantly reduces the number of TCP round trip message exchanges. 5.8.2 Caching and Proxies Probably the most interesting and also most complex part of HTTP is its support for proxies and caching. Proxies are entities that exist between the client and the server and which basically relays requests and responses. The HTTP 1.1 specifically describes how proxies are supposed to handle requests and it defines message headers that can be used by a client to learn about the proxies between the client and the server or to control how many proxies may exist in the path between the client and the server. Some proxies and clients also maintain caches where copies of documents are stored in local storage space to speedup future accesses to these cached documents. The HTTP protocol allows a client to interrogate the server to determine whether the document has changed or not. Caching is a key to the efficient operation of the Web. HTTP allows servers to control whether and how a page can be cached as well as its lifetime. Furthermore, browsers can force a request to bypass caches and obtain a fresh copy from a server. Note that not all problems related to HTTP proxies and caches have been solved. A good list of issues can be found in RFC 3143 [75]. 5.8.3 Negotiation HTTP supports header fields that allow to negotiate capabilities and preferences. There are two different mechanisms: 1. Server-driven negotiation begins with a request from a client (a browser). The client indicates a list of its preferences. The server then decides how to best respond to the request. 2. Client-driven negotiation requires two requests. The client first asks the server what is available and then decides which concrete request to send to the server. Negotation can be used to select different document formats, different transfer encodings, different languages or different character sets. In most cases, server-driven negotiation is used since this is much more efficient. The following is an example of typical negotiation header lines: Accept: text/xml, application/xml, text/html;q=0.9, text/plain;q=0.8 Accept-Language: de, en;q=0.5 Accept-Encoding: gzip, deflate, compress;q=0.9 Accept-Charset: ISO-8859-1, utf8;q=0.66, *;q=0.33 The last line says that the client prefers ISO-8859-1 encoding (with preference 1), UTF8 encoding with preference 0.66 and any other encoding with preference 0.33. 5.8. HYPERTEXT TRANSFER PROTOCOL (HTTP) 5.8.4 109 Conditional Requests HTTP allows a client to make a request conditional by including headers that qualify the conditions under which the request should be honored. Conditional requests can be used to avoid unnecessary requests. This can best be explained by looking at a concrete example. Suppose a client has cached a certain document and wants to check whether the document in the cache needs to be updated. The client can now send a request which includes the following header line: If-Modified-Since: Wed, 26 Nov 2003 23:21:08 +0100 The server checks whether the document was changed after the date indicated by the header line and only process the request if this is the case. 5.8.5 Delta Encoding Delta Encodings are defined in RFC 3229 [76] as a mechanism to request only the changes relative to a specific version. The motivation behind this proposal is the observation that many documents only change slightly over time and that it is often much more efficient to just retrieve the changes instead of retrieving the whole document. 5.8.6 HTTP as a Substrate HTTP has been so successful that it is often used as a substrate for other application protocols. Examples are the Internet Printing Protocol (IPP) or the Simple Object Access Protocol (SOAP). However, HTTP was not designed for this purpose and there are some pitfalls. For those interested in the details, consult RFC 3205 [77]. 110 CHAPTER 5. INTERNET APPLICATION LAYER Appendix A Packet Capturing Networks sometimes do not function as expected and in some situations it is useful to capture the relevant frames / packets for analysis. It is also often useful to count certain frames/packets in order to obtain usage statistics. There are special programs such as tcpdump, ethereal or ngrep which can be used to capture and analyze packets / frames. A performance critical aspect is the interface between the network interface (hardware), the operating system kernel and the user space programs. To achieve good performance, all the components have to play well together: • Captured packets should be filtered or aggregated as early as possible. • The copying of packets is an expensive operation and should be avoided wherever possible. • The number of system calls and the associated context switches should be minimized. An approach to address this problem is to discard unimportant frames / packets as early as possible, ideally before they are copied from the device driver’s memory to the operating system kernel memory. A.1 BSD Packet Filter (BPF) The BSD Packet Filter (BPF) [78] is based on a simple but powerful control-flow machine which is implemented within the Unix kernel. It allows for an effective filtering of frames / packets without expensive copying operations. The BPF is programmed by user space applications which transform human readable filter expressions into a sequence of BPF instructions and install them into the kernel. Packets which match the BPF filter are first collected within the kernel and moved to the user space application in batches in order to reduce the number and costs for the required system calls. The BPF machine has the following components: • An accumulator for all calculations. • An index register (x) which allows to access data relative to a certain position. • Memory for storing intermediate results. • All registers and memory locations are 32-bit wide. The instruction set of the BPF machine uses a fixed format which can be interpreted efficiently. Below are some example BPF programs. For more details, see [78]. Example 1 Select all Ethernet frames which contain IPv4 packets (ip): (000) (001) (002) (003) ldh jeq ret ret [12] #0x800 #96 #0 jt 2 ; jf 3 ; ; ; load ethernet type field compare with 0x800 return snaplen filter failed 111 112 APPENDIX A. PACKET CAPTURING user kernel buffer buffer buffer protocol stack filter filter filter BPF link−level driver link−level driver link−level driver kernel network Figure A.1: BPF within the Unix kernel Example 2 Select all Ethernet frames which contain IPv4 packets which do not originate are not from the networks 128.3.112/24 and 128.3.254/24 (ip and not src net 128.3.112/24 and not src net 128.3.254/24): (000) (001) (002) (003) (004) (005) (006) (007) A.2 ldh jeq ld and jeq jeq ret ret [12] #0x800 jt 2 [26] #0xffffff00 #0x80037000 jt 7 #0x8003fe00 jt 7 #96 #0 ; jf 7 ; ; ; jf 5 ; jf 6 ; ; ; load ethernet type field compare with 0x800 load ipv4 address field mask the network part compare with 128.3.112.0 compare with 128.3.254.0 return snaplen filter failed libpcap The libpcap1 C library provides the following functionality: • A portable API that hides the differences of packet filter implementations in different operating systems. • A compiler which translates human readable filter expressions into BPF programs. • An interpreter for BPF programs which can be used to filter (previously captured) packets in user space. • Functions for writing captured packets to files and for reading previously captured packets from files. The usage of the libpcap API can best be illustrated by an example program. The following C source code implements a program which opens a file containing previously captured packets, optionally installs a filter and then processes the filtered packets in a callback function. A more detailed description of the libpcap API can be found in the pcap(3) manual page. The syntax of the supported filter expression is documented in the tcpdump(1) manual page. A.3 jpcap Programmers who prefer the Java language can use the Java packet jpcap2 which allows to process captured packets with Java programs. The jpcap package is implemented as a Java wrapper around the libpcap API. 1 2 http://www.tcpdump.org/ http://www.sf.net/projects/jpcap A.3. JPCAP 113 The C program from the previous section can be written in Java as follows: For a more detailed description of the jpcap API, see the jpcap Java documentation. Before using the jpcap API, please check whether Java is fast enough for processing data rates from high speed networks. 114 APPENDIX A. PACKET CAPTURING Appendix B Sockets The socket application programming interface (API) was developed at the University of California in Berkeley as part of the work on the BSD Unix system. The socket interface is a generic interface for interprocess communication using message passing. Sockets are abstract communication endpoints with a rather small number of associated function calls. The socket API distinguishes different types of sockets: • A stream socket (SOCK STREAM) is a bidirectional reliable communication endpoint. Data written to a local stream socket can be read from a remote stream socket without having to worry about transmission errors, fragmentation or reordering that might occur in the underlying network. While the order of the byte stream is not changed, the data block boundaries are not preserved. • A datagram socket (SOCK DGRAM) is a bidirectional unreliable communication endpoint which allows to exchange datagrams. Datagrams send over the local datagram socket may not be received by the remote datagram socket or they may be received multiple times. Furthermore, the ordering of the datagrams can change during the transmission. Note that datagram boundaries are preserved. • A raw socket (SOCK RAW) is a communication endpoint which allows to receive and send network or interface layer datagrams. • A reliable delivered message socket (SOCK RDM) is similar to a datagram socket but provides in addition reliable datagram delivery. • A sequenced packet socket (SOCK SEQPACKET) is similar to a stream socket but retains data block boundaries. B.1 Socket Addresses It is necessary to assign a name (an address) to a communication endpoint before it can be used. The socket API supports different name spaces with different address formats. The generic data structure for addresses (struct sockaddr) is defined as follows: #include <sys/socket.h> struct sockaddr uint8_t sa_family_t char }; { sa_len sa_family; sa_data[...]; struct sockaddr_storage { uint8_t ss_len; sa_family_t ss_family; char padding[...]; }; /* address length (BSD) */ /* address family */ /* data of some size */ /* address length (BSD) */ /* address family */ /* padding of some size */ 115 116 APPENDIX B. SOCKETS Newer BSD systems support the (sa len) field in the generic and the specific socket addresses which was not present in the older socket API. Other systems usually do not have this (sa len) member (although it is generally a good idea to have this member). The currently most important name spaces are the name spaces for the Internet and a name space for local communication: IPv4 Socket Addresses Sockets that represent IPv4 communication endpoints use the address family AF INET and the protocol family PF INET. IPv4 transport addresses are represented by the structure struct sockaddr in: #include <sys/socket.h> typedef ... sa_family_t; #include <netinet/in.h> typedef ... in_port_t; struct in_addr { uint8_t s_addr[4]; }; /* IPv4 address */ struct sockaddr_in { uint8_t sin_len; sa_family_t sin_family; in_port_t sin_port; struct in_addr sin_addr; }; /* /* /* /* address length (BSD) */ address family */ transport layer port */ IPv4 address */ IPv6 Socket Addresses Sockets that represent IPv6 communication endpoints use the address family AF INET6 and and the protocol family PF INET6. IPv6 transport addresses are represented by the structure struct sockaddr in6: #include <sys/socket.h> typedef ... sa_family_t; #include <netinet/in.h> typedef ... in_port_t; struct in6_addr { uint8_t s6_addr[16]; }; /* IPv6 address */ struct sockaddr_in6 { uint8_t sin6_len; sa_family_t sin6_family; in_port_t sin6_port; uint32_t sin6_flowinfo; struct in6_addr sin6_addr; uint32_t sin6_scope_id; }; /* /* /* /* /* /* address length (BSD) */ address family */ transport layer port */ flow information */ IPv6 address */ scope identifier */ 117 B.2. COMMUNICATION KINDS Local Socket Addresses Sockets that represent local communication endpoints use the address family AF LOCAL and the protocol family PF LOCAL. Local sockets are also widely known under the name Unix sockets with the address family AF UNIX and the protocol family PF UNIX. Local socket addresses are represented by the structure struct sockaddr un: #include <sys/socket.h> typedef ... sa_family_t; #include <sys/un.h> struct sockaddr_un { uint8_t sun_len; sa_family_t sun_family; char sun_path[108]; }; B.2 /* address length (BSD) */ /* address family */ /* xxx Is 108 POSIX ? */ Communication Kinds The socket API allows to realize different communication kinds in application programs. The most important two styles are connection-less datagram communication and connection-oriented data stream communication. socket() bind() socket() bind() recvfrom() data sendto() data sendto() recvfrom() close() Figure B.1: Connection-less datagram communication Figure B.1 shows how a server and a client make use of the socket primitives to provide and realize a connection-less datagram application protocol. After creating and binding a local socket, the processes use the recvfrom() and the sendto() primitives to receive and send datagrams. Figure B.2 shows how a server and a client make use of the socket primitives to provide and realize a connection-oriented application protocol. The server creates a listening local socket which is used to accept incoming connections. Once a connection has been accepted, a new local file descriptor is returned which can be used to read() or write() data. The close() function is called to close the connection. On the client side, the connect() function is used to connected the local socket to a remote (server) socket. When the connect() function returns successfully, normal read() or write() functions can be used to exchange data. The close() function is again called to close the connection. 118 APPENDIX B. SOCKETS socket() bind() listen() socket() connection setup accept() data read() data write() connection release close() connect() write() read() close() Figure B.2: Connection-oriented data stream communication B.3 Socket API Overview The following definitions summarize the socket API. For a more detailed description, please consult the corresponding Unix manual pages. #include <sys/types.h> #include <sys/socket.h> #include <unistd.h> #define #define #define #define #define SOCK_STREAM SOCK_DGRAM SCOK_RAW SOCK_RDM SOCK_SEQPACKET ... ... ... ... ... #define AF_LOCAL ... #define AF_INET ... #define AF_INET6 ... #define PF_LOCAL ... #define PF_INET ... #define PF_INET6 ... int socket(int domain, int type, int protocol); int bind(int socket, struct sockaddr *addr, socklen_t addrlen); int connect(int socket, struct sockaddr *addr, socklen_t addrlen); int listen(int socket, int backlog); 119 B.4. NAME RESOLUTION int accept(int socket, struct sockaddr *addr, socklen_t *addrlen); ssize_t write(int socket, void *buf, size_t count); int send(int socket, void *msg, size_t len, int flags); int sendto(int socket, void *msg, size_t len, int flags, struct sockaddr *addr, socklen_t addrlen); ssize_t read(int socket, void *buf, size_t count); int recv(int socket, void *buf, size_t len, int flags); int recvfrom(int socket, void *buf, size_t len, int flags, struct sockaddr *addr, socklen_t *addrlen); int shutdown(int socket, int how); int close(int socket); int getsockopt(int socket, int level, int optname, void *optval, socklen_t *optlen); int setsockopt(int socket, int level, int optname, void *optval, socklen_t optlen); int getsockname(int socket, struct sockaddr *addr, socklen_t *addrlen); int getpeername(int socket, struct sockaddr *addr, socklen_t *addrlen); B.4 Name Resolution Numeric addresses are usually hard to memorize for humans. It thus useful to introduce more human friendly symbolic names. The Internet protocols use the Domain Name System (DNS) to map symbolic names to Internet addresses. Note that DNS supports IPv4 as well as IPv6 addresses. Furthermore, there is usually also a locally defined mapping of well-known port numbers to symbolic names (e.g., port number 80 has the well-known symbolic names http or www). The name to address mapping is supported by the functions getaddrinfo() and getnameinfo() which are described below. Many older programs still use the functions gethostbyname() and gethostbyaddr() which have been deprecated. #include <sys/types.h> #include <sys/socket.h> #include <netdb.h> #define AI_PASSIVE ... #define AI_CANONNAME ... #define AI_NUMERICHOST ... struct addrinfo { int int int int size_t struct sockaddr char struct addrinfo }; ai_flags; ai_family; ai_socktype; ai_protocol; ai_addrlen; *ai_addr; *ai_canonname; *ai_next; int getaddrinfo(const char *node, const char *service, const struct addrinfo *hints, 120 APPENDIX B. SOCKETS struct addrinfo **res); void freeaddrinfo(struct addrinfo *res); const char *gai_strerror(int errcode); The mapping of names to addresses is realized by the function getaddrinfo(). This function has three input parameters (node, service, hints) and returns a pointer to a list of struct addrinfo elements. This list must be released by calling freeaddrinfo() if it is not used anymore. In case of an error, getaddrinfo() returns a value unequal to 0 which can be passed to gai strerror() in order to get a human readable error description. One of the arguments node and service can be NULL thus requesting only a name resolution of the other element. The name resolution process can be further controlled by passing some hints to the function. Hints can be used, for example, to request addresses of a certain address family or socket type. #include <sys/types.h> #include <sys/socket.h> #include <netdb.h> #define #define #define #define #define #define NI_NOFQDN NI_NUMERICHOST NI_NAMEREQD NI_NUMERICSERV NI_NUMERICSCOPE NI_DGRAM ... ... ... ... ... ... int getnameinfo(const struct sockaddr *sa, socklen_t salen, char *host, size_t hostlen, char *serv, size_t servlen, int flags); const char *gai_strerror(int errcode); The inverse mapping of addresses to symbolic names is supported by the function getnameinfo(). The first two parameters (sa, salen) are input parameters. The result of the mapping is a host name and a service name which is written to the memory location host with the length hostlen and serv with the length servlen. Additional flags can be passed to the mapping function in order to control the details of the mapping process. Example 3 The following source code implements a client for a simple connection-oriented protocol which retrieves the date and time from a server. Example 4 The following source code implements a server for a simple connection-oriented protocol which retrieves the date and time from a server. Example 5 The following source code implements a client for a simple connection-less protocol which retrieves the date and time from a server. Example 6 The following source code implements a server for a simple connection-less protocol which retrieves the date and time from a server. B.5 Multiplexing The examples discussed so far all had the property that the server or the client could block (freeze) in case of some communication errors. For example, the connection oriented server block incoming requests until a client has been served. One approach to address this deficiency is to use threads which of course requires thread-safe libraries. The other alternative is to avoid calling blocking functions by first checking whether a socket can be read or written. B.5. MULTIPLEXING 121 #include <sys/select.h> typedef ... fd_set; FD_ZERO(fd_set *set); FD_SET(int fd, fd_set *set); FD_CLR(int fd, fd_set *set); FD_ISSET(int fd, fd_set *set); int select(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); int pselect(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timespec *timeout, sigset_t sigmask); The functions select() and pselect() can be used to test whether socket descriptors are “ready” so that a subsequent socket library call does not block. The select() call distinguishes three sets of socket descriptors: 1. The set readfds contains descriptors which will be watched to see if a subsequent read operation will not block. 2. The set writefds contains descriptors which will be watched to see if a subsequent write operation will not block. 3. The set exceptfds contains the descriptors which will be watched for excetions. The macros FD ZERO(), FD SET(), FD CLR() and FD ISSET() can be used to manipulate the sets. The timeout is an upper bound on the amount of time elapsed before the select function returns. The parameter n contains the highest-numbered file descriptor in any of the three sets plus 1. The function pselect() allows the correct handling of situations where a program wants to wait for socket descriptors as well as software signals. Example 7 The following source code combines the connection-less server with the connection-oriented server. The main loop uses the select() function to wait for incoming requests. 122 APPENDIX B. SOCKETS Bibliography [1] F. Halsall. Data Communications, Computer Networks and Open Systems. Addison-Wesley, 4 edition, 1996. [2] A. S. Tanenbaum. Computer Networks. Prentice Hall, 4 edition, 2002. [3] W. Stallings. Data and Computer Communications. Prentice Hall, 6 edition, 2000. [4] D. E. Comer. Internetworking with TCP/IP: Principles, Protocols, and Architectures. Prentice Hall, 4 edition, 2000. [5] C. Huitema. Routing in the Internet. Prentice Hall, 2 edition, 1999. [6] W. R. Stevens. TCP/IP Illustrated, Volume 1: The Protocols. Addison Wesley, 1994. [7] J. F. Kurose and K. W. Ross. Computer Networking: A Top-Down Approach Featuring the Internet. AddisonWesley, 3 edition, 2004. [8] R. Hinden and S. Deering. Internet Protocol Version 6 (IPv6) Addressing Architecture. RFC 3513, Nokia, Cisco Systems, April 2003. [9] R. Elz. A Compact Representation of IPv6 Addresses. RFC 1924, University of Melbourne, April 1996. [10] D. Waitzman. A Standard for the Transmission of IP Datagrams on Avian Carriers. RFC 1149, BBN STC, April 1990. [11] P. Hoffman and S. Bradner. Defining the IETF. RFC 3233, Internet Mail Consortium, Harvard University, February 2002. [12] S. Bradner. The Internet Standards Process – Revision 3. RFC 2026, Harvard University, October 1996. [13] R. M. Metcalfe and D. R. Boggs. Ethernet: Distributed packet switching for local computer networks. Communications of the ACM, 19(5):395–404, July 1976. [14] ANSI/IEEE. Local Area Networks: CSMA/CD, Std 802.3, 1988. [15] B. Carpenter. Architectural Principles of the Internet. RFC 1958, IAB, June 1996. [16] R. Bush and D. Meyer. Some Internet Architectural Guidelines and Philosophy. RFC 3439, December 2002. [17] S. Deering and R. Hinden. Internet Protocol, Version 6 (IPv6) Specification. RFC 2460, Cisco, Nokia, December 1998. [18] J. Postel. Internet Protocol. RFC 791, ISI, September 1981. [19] IANA. Special-Use IPv4 Addresses. RFC 3330, Internet Assigned Numbers Authority, September 2002. [20] Y. Rekhter, B. Moskowitz, D. Karrenberg, G. J. deGroot, and E. Lear. Address Allocation for Private Internets. RFC 1918, Cisco Systems, Chrysler Corp., RIPE NCC, Silicon Graphics, Inc., February 1996. [21] K. Nichols, S. Blake, F. Baker, and D. Black. Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers. RFC 2474, Cisco Systems, Torrent Networking Technologies, EMC Corporation, December 1998. [22] K. Ramakrishnan, S. Floyd, and D. Black. The Addition of Explicit Congestion Notification (ECN) to IP. RFC 3168, TeraOptic Networks, ACIRI, EMC, September 2001. [23] D. Grossman. New Terminology and Clarifications for Diffserv. RFC 3260, Motorola, Inc., April 2002. [24] F. Baker. Requirements for IP Version 4 Routers. RFC 1812, Cisco Systems, June 1995. [25] G. Trotter. Terminology for Forwarding Information Base (FIB) based Router Performance. RFC 3222, Agilent Technologies, December 2001. 123 124 BIBLIOGRAPHY [26] M. A. Ruiz-Sánchez, E. W. Biersack, and W. Dabbous. Survey and Taxonomy of IP Address Lookup Algorithms. IEEE Network, pages 8–23, March 2000. [27] J. Postel. Internet Control Message Protocol. RFC 792, ISI, September 1981. [28] C. Kent and J. Mogul. Fragmentation Considered Harmful. In Proc. SIGCOMM ’87 Workshop on Frontiers in Computer Communications Technology, August 1987. [29] J. Mogul and S. Deering. Path MTU Discovery. RFC 1191, DECWRL, Stanford University, November 1990. [30] C. Hornig. A Standard for the Transmission of IP Datagrams over Ethernet Networks. RFC 894, Symbolics Cambridge Research Center, April 1984. [31] D. C. Plummer. An Ethernet Address Resolution Protocol. RFC 826, MIT, November 1982. [32] R. Finlayson, T. Mann, J. Mogul, and M. Theimer. A Reverse Address Resolution Protocol. RFC 903, Stanford University, June 1984. [33] R. Droms. Dynamic Host Configuration Protocol. RFC 2131, Bucknell University, March 1997. [34] S. Alexander and R. Droms. DHCP Options and BOOTP Vendor Extensions. RFC 2132, Silicon Graphics, Bucknell University, March 1997. [35] R. Droms and W. Arbaugh. Authentication for DHCP Messages. RFC 3118, Cisco Systems, University of Maryland, June 2001. [36] R. Gilligan and E. Nordmark. Transition Mechanisms for IPv6 Hosts and Routers. RFC 2893, FreeGate Corp., Sun Microsystems, August 2000. [37] S. Kent and R. Atkinson. IP Authentication Header. RFC 2402, BBN Corporation, At Home Network, November 1998. [38] S. Kent and R. Atkinson. IP Encapsulating Security Payload (ESP). RFC 2406, BBN Corporation, At Home Network, November 1998. [39] A. Conta and S. Deering. Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification. RFC 2463, Lucent, Cisco Systems, December 1998. [40] M. Crawford. Transmission of IPv6 Packets over Ethernet Networks. RFC 2464, Fermilab, December 1998. [41] T. Narten, E. Nordmark, and W. Simpson. Neighbor Discovery for IP Version 6 (IPv6). RFC 2461, IBM, Sun Microsystems, Daydreamer, December 1998. [42] G. Malkin. RIP Version 2. RFC 2453, Bay Networks, November 1998. [43] F. Baker and R. Atkinson. RIP-2 MD5 Authentication. RFC 2082, Cisco Systems, January 1997. [44] J. Moy. OSPF Version 2. RFC 2328, Ascend Communications, April 1998. [45] Y. Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4). RFC 1771, IBM, Cisco, March 1995. [46] G. Huston. The BGP Routing Table. The Internet Journal, 4(1), March 2001. [47] J. Postel. User Datagram Protocol. RFC 768, ISI, August 1980. [48] J. Postel. Transmission Control Protocol. RFC 793, ISI, September 1981. [49] M. Allman, V. Paxson, and W. Stevens. TCP Congestion Control. RFC 2581, NASA Glenn/Sterling Software, ACIRI/ICSI, April 1999. [50] M. Allman, S. Floyd, and C. Partridge. Increasing TCP’s Initial Window. RFC 3390, BBN/NASA GRC, ICIR, BBN Technologies, October 2002. [51] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxson. Stream Control Transmission Protocol. RFC 2960, Motorola, Cisco, Siemens, Nortel Networks, Ericsson, Telcordia, UCLA, ACIRI, October 2000. [52] L. Ong and J. Yoakum. An Introduction to the Stream Control Transmission Protocol (SCTP). RFC 3286, Ciena Corporation, Nortel Networks, May 2002. [53] J. Stone, R. Stewart, and D. Otis. Stream Control Transmission Protocol (SCTP) Checksum Change. RFC 3309, Stanford, Cisco Systems, SANlight, September 2002. [54] P. Mockapetris. Domain Names - Concepts and Facilities. RFC 1034, ISI, November 1987. [55] P. Mockapetris. Domain Names - Implementation and Specification. RFC 1035, ISI, November 1987. [56] D. Eastlake. Domain Name System Security Extensions. RFC 2535, IBM, March 1999. BIBLIOGRAPHY 125 [57] P. Faltstrom, P. Hoffman, and A. Costello. Internationalizing Domain Names in Applications (IDNA). RFC 3490, Cisco, IMC & VPNC, UC Berkeley, March 2003. [58] P. Hoffman and M. Blanchet. Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN). RFC 3491, IMC & VPNC, Viagenie, March 2003. [59] A. Costello. Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). RFC 3492, UC Berkeley, March 2003. [60] S. Thomson and C. Huitema. DNS Extensions to support IP version 6. RFC 1886, Bellcore, INRIA, December 1995. [61] S. Thomson, C. Huitema, V. Ksinant, and M. Souissi. DNS Extensions to Support IP Version 6. RFC 3596, Cisco, Microsoft, 6WIND, AFNIC, October 2003. [62] ITU. Information technology - Abstract Syntax Notation One (ASN.1): Specification of basic notation. Recommendation ITU-T X.680, International Telecommunication Union, December 1997. [63] ITU. Information technology - ASN.1 encoding rules: Specification of Basic Encoding Rules (BER), Canonical Encoding Rules (CER) and Distinguished Encoding Rules (DER). Recommendation ITU-T X.690, International Telecommunication Union, December 1997. [64] D. Steedman. Abstract Syntax Notation One (ASN.1): The Tutorial and Reference. Technology Appraisals, 1990. [65] M. Rose and K. McCloghrie. Structure and Identification of Management Information for TCP/IP-based Internets. RFC 1155, Performance Systems International, Hughes LAN Systems, May 1990. [66] J. Case, M. Fedor, M. Schoffstall, and J. Davin. A Simple Network Management Protocol. RFC 1157, SNMP Research, PSI, MIT, May 1990. [67] S. Legg. Generic String Encoding Rules (GSER) for ASN.1 Types. RFC 3641, Adacel Technologies, October 2003. [68] D. Crocker and P. Overell. Augmented BNF for Syntax Specifications: ABNF. RFC 2234, Internet Mail Consortium, Demon Internet Ltd., November 1997. [69] J. Postel. Simple Mail Transfer Protocol. RFC 821, ISI, August 1982. [70] J. Klensin. Simple Mail Transfer Protocol. RFC 2821, AT&T Laboratories, April 2001. [71] J. Postel and J. Reynolds. File Transfer Protocol (FTP). RFC 959, ISI, October 1985. [72] M. Allman and S. Ostermann. FTP Security Considerations. RFC 2577, NASA Glenn/Sterling Software, Ohio University, May 1999. [73] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext Transfer Protocol – HTTP/1.1. RFC 2616, UC Irvine, Compaq/W3C, Compaq, W3C/MIT, Xerox, Microsoft, W3C/MIT, June 1999. [74] T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax. RFC 2396, MIT/LCS, U.C. Irvine, Xerox Corporation, August 1998. [75] I. Cooper and J. Dilley. Known HTTP Proxy/Caching Problems. RFC 3143, Equinix, Akamai Technologies, June 2001. [76] J. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, Y. Goland, A. van Hoff, and D. Hellerstein. Delta encoding in HTTP. RFC 3229, Compaq WRL, AT&T, Univ. of Saarbruecken, Marimba, ERS/USDA, January 2002. [77] K. Moore. On the use of HTTP as a Substrate. RFC 3205, University of Tennessee, February 2002. [78] S. McCanne and V. Jacobson. The BSD Packet Filter: A New Architecture for User-level Packet Capture. In Proc. Usenix Winter Conference, January 1993.