Lecture Notes - Computer Networks and Distributed Systems

advertisement
Networks and Protocols
(320301)
Lecture Notes Fall 2005
Jürgen Schönwälder
August 27, 2005
School of Engineering and Science
International University Bremen
Preface
The lecture “Networks and Protocols” is an introduction into the foundations of packet switched
data communication networks. The lecture covers widely deployed Internet technologies and the
IEEE 802 standards for local area networks.
The selection of the material covered in this lecture is specifically dealing with widely deployed
technologies and protocols. This approach has the disadvantage to ignore some interesting alternate technologies which did not become widely deployed for whatever (often non-technical) reason.
However, the advantage of this approach is that more time can be spend to discuss the selected technologies to some level of detail and to enable and encourage students to experiment with their own
network infrastructure. This usually increases the understanding of the material and the motivation.
Some parts of the lecture notes date back to a lecture called “Introduction to Operating Systems
and Networks” which I have given at the Technical University Braunschweig. These notes were
later heavily revised and extended for a lecture called ”Computer Networks” which I have given at
the University of Osnabrück. Some parts of these lecture notes are heavily influenced by standard
text books such as [1, 2, 3, 4, 5, 6, 7] while other parts are directly derived from the relevant standards. Students who want to understand the discussed protocols in even more details are strongly
encouraged to read the relevant parts of the standards which are referenced throughout the text.
My thanks go to the many students who asked critical questions and provided constructive feedback
which improved the presentation and reduced the amount of errors and inconsistencies.
Jürgen Schönwälder
Contents
1
2
3
Introduction
1.1 Fundamental Concepts . . . . . .
1.1.1 Services . . . . . . . . . .
1.1.2 Protocols . . . . . . . . .
1.1.3 Names and Addresses . .
1.1.4 ISO/OSI Reference Model
1.1.5 Internet Reference Model
1.2 Standardization . . . . . . . . . .
1.2.1 ISO Standardization . . .
1.2.2 Internet Standardization .
1.2.3 IEEE Standardization . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
IEEE 802 Local Area Networks
2.1 Logical Link Control (IEEE 802.2) . . . . . . .
2.2 Ethernet (IEEE 802.3) . . . . . . . . . . . . .
2.2.1 Physical Layer (PHY) . . . . . . . . .
2.2.2 Medium Access Layer (MAC) . . . . .
2.2.3 Fast-Ethernet (IEEE 802.3u) . . . . . .
2.2.4 Gigabit Ethernet (IEEE 802.3z/802.3ab)
2.2.5 10 Gigabit Ethernet (IEEE 802.3ae) . .
2.3 Wireless LANs (IEEE 802.11) . . . . . . . . .
2.4 Bluetooth LANs (IEEE 802.15) . . . . . . . .
2.5 Port Access Control (IEEE 802.1X) . . . . . .
2.6 Bridges . . . . . . . . . . . . . . . . . . . . .
2.6.1 Source Routing Bridges . . . . . . . .
2.6.2 Transparent Bridges (IEEE 802.1D) . .
2.7 Virtual LANs (IEEE 802.1Q) . . . . . . . . . .
2.8 LAN Priorities (IEEE 802.1D) . . . . . . . . .
Internet Network Layer
3.1 Fundamentals . . . . . . . . . . .
3.1.1 Evolution of the Internet .
3.1.2 Internet Design Principles
3.1.3 Basic Terminology . . . .
3.1.4 Autonomous Systems . .
3.1.5 Internet Address Scopes .
3.2 Internet Protocol Version 4 (IPv4)
3.2.1 IPv4 Addresses . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
4
6
7
8
8
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
12
12
13
14
17
17
18
18
18
18
19
20
21
23
26
.
.
.
.
.
.
.
.
27
27
27
28
28
29
30
30
30
CONTENTS
3.3
3.4
4
5
3.2.2 IPv4 Packet Format . . . . . . . . . . .
3.2.3 IPv4 Forwarding . . . . . . . . . . . .
3.2.4 IPv4 Error Handling (ICMPv4) . . . .
3.2.5 MTU Path Discovery . . . . . . . . . .
3.2.6 IPv4 over IEEE 802.3 . . . . . . . . .
3.2.7 IPv4 Adress Translation (ARP, RARP)
3.2.8 Automatic Configuration (DHCP) . . .
Internet Protocol Version 6 (IPv6) . . . . . . .
3.3.1 IPv6 Addresses . . . . . . . . . . . . .
3.3.2 IPv6 Packet Format . . . . . . . . . . .
3.3.3 IPv6 Extensions . . . . . . . . . . . .
3.3.4 IPv6 Forwarding . . . . . . . . . . . .
3.3.5 IPv6 Error Handling (ICMPv6) . . . .
3.3.6 IPv6 over IEEE 802.3 . . . . . . . . .
3.3.7 IPv6 Neighbor Discovery . . . . . . . .
Routing Protocols . . . . . . . . . . . . . . . .
3.4.1 Routing Information Protocol (RIP) . .
3.4.2 Open Shortest Path First (OSPF) . . . .
3.4.3 Border Gateway Protocol (BGP) . . . .
Internet Transport Layer
4.1 Pseudo Header . . . . . . . . . . . . . . . . .
4.2 User Datagram Protocol (UDP) . . . . . . . . .
4.3 Transmission Control Protocol (TCP) . . . . .
4.3.1 Connection Establishment . . . . . . .
4.3.2 Connection Tear-down . . . . . . . . .
4.3.3 State Machine . . . . . . . . . . . . . .
4.3.4 Flow Control . . . . . . . . . . . . . .
4.3.5 Congestion Control . . . . . . . . . . .
4.3.6 Retransmission Timer . . . . . . . . .
4.4 Stream Control Transmission Protocol (SCTP) .
4.5 Datagram Congestion Control Protocol (DCCP)
Internet Application Layer
5.1 Domain Name System (DNS) . . . . . . . . .
5.1.1 Format of Domain Names . . . . . . .
5.1.2 Resource Records . . . . . . . . . . . .
5.1.3 DNS Message Formats . . . . . . . . .
5.2 Abstract Syntax Notation One (ASN.1) . . . .
5.2.1 Basic Concepts . . . . . . . . . . . . .
5.2.2 ISO Registration Tree . . . . . . . . .
5.2.3 Primitive ASN.1 Data Types . . . . . .
5.2.4 Constructed ASN.1 Data Types . . . .
5.2.5 Restrictions on ASN.1 Data Types . . .
5.2.6 ASN.1 Tags . . . . . . . . . . . . . . .
5.2.7 Example ASN.1 Definition . . . . . . .
5.2.8 Basic Encoding Rules (BER) . . . . . .
5.2.9 Generic String Encoding Rules (GSER)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
33
34
36
37
37
38
40
41
43
44
48
48
49
49
53
53
55
59
.
.
.
.
.
.
.
.
.
.
.
63
64
65
65
66
67
68
69
70
72
73
74
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
78
79
79
82
83
83
84
84
84
84
85
88
90
CONTENTS
5.3
5.4
5.5
5.6
5.7
5.8
Simple Network Mangement Protocol (SNMP) . . . .
5.3.1 Foundations . . . . . . . . . . . . . . . . . . .
Augmented Backus-Naur Form (ABNF) . . . . . . . .
5.4.1 Rule Names, Comments and Terminal Symbols
5.4.2 Operators . . . . . . . . . . . . . . . . . . . .
5.4.3 Core Definitions . . . . . . . . . . . . . . . .
5.4.4 ABNF in ABNF . . . . . . . . . . . . . . . .
Simple Mail Transfer Protocol (SMTP) . . . . . . . .
5.5.1 Grundlagen . . . . . . . . . . . . . . . . . . .
5.5.2 Kommandos und Antworten . . . . . . . . . .
5.5.3 Nachrichtenk”opfe . . . . . . . . . . . . . . .
5.5.4 Multipurpose Internet Mail Extensions (MIME)
Internet Message Access Protocol (IMAP) . . . . . . .
5.6.1 Identifikation und Zust”ande . . . . . . . . . .
5.6.2 Zust”ande . . . . . . . . . . . . . . . . . . . .
5.6.3 Kommandos . . . . . . . . . . . . . . . . . .
5.6.4 Tagging . . . . . . . . . . . . . . . . . . . . .
5.6.5 Nachrichtenformat . . . . . . . . . . . . . . .
File Transfer Protocol (FTP) . . . . . . . . . . . . . .
Hypertext Transfer Protocol (HTTP) . . . . . . . . . .
5.8.1 Persistent Connections and Pipelining . . . . .
5.8.2 Caching and Proxies . . . . . . . . . . . . . .
5.8.3 Negotiation . . . . . . . . . . . . . . . . . . .
5.8.4 Conditional Requests . . . . . . . . . . . . . .
5.8.5 Delta Encoding . . . . . . . . . . . . . . . . .
5.8.6 HTTP as a Substrate . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
92
92
92
94
94
96
96
97
100
100
102
102
103
103
104
105
106
107
108
108
108
109
109
109
A Packet Capturing
111
A.1 BSD Packet Filter (BPF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.2 libpcap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.3 jpcap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B Sockets
B.1 Socket Addresses . . .
B.2 Communication Kinds
B.3 Socket API Overview .
B.4 Name Resolution . . .
B.5 Multiplexing . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
115
117
118
119
120
Chapter 1
Introduction
1.1
Fundamental Concepts
This section discusses some architectural concepts and introduces some basic terms. A very fundamental approach to deal with complex systems is to divide them into sub-systems with well-defined
interfaces between those subsystems. Accordingly, networks are usually designed as layered systems where each layer is responsible to provide a certain function. The layering principle is a very
fundamental principle for structuring communication systems.
1.1.1
Services
The abstract set of functions provided by a given network are called the service realized by the
network. The service provided by a network is usually defined by abstract terms to allow for
multiple concrete programming interfaces.
• A service is used by using one or more service primitives. Typical ISO/OSI service primitives
are:
– Request of a service (request)
– Indication of the request of a service (indication)
– Response to the request of a service (response)
– Confirmation of the requested service (confirmation)
• The interface, which is used to access the service primitives, is called a service access point.
• Services are realized by so called (protocol) instances. The instances at layer N of a layered
system are accordingly called N -instances.
• Strict layering requires that N -instances may only use services realized by (N − 1)-instances
to realize the layer N service.
1
2
CHAPTER 1. INTRODUCTION
1.1.2
Protocols
A protocol is a set of functions which together realize a well-defined communication service (e.g.,
error-free ordered transmission of data from a sender to a receiver).
• Protocols define the format and the semantics of the protocol data units (PDUs) exchanged
between communicating parties.
• Protocols specifically define the rules which have to be followed when creating or processing
protocol data units.
• The instantiation of a protocol at runtime is realized by a so called protocol instance. In the
general case, it is possible to have multiple instances of a protocol running concurrently on a
single system.
• Specialized protocols have been developed for various application domains. Note that a certain service can be realized by multiple different protocols.
• The specification of protocols can be either informal (plain English text) or formal. Formal
protocol specifications usually use specification languages such as Lotos, Estelle or SDL
which have been developed for this purpose.
1.1.3
Names and Addresses
Protocol instances are usually identified by some sort of an address. Addresses for protocol instances that only exist once on a certain system are often also used to identify whole systems. In
addition, more human friendly names are often used and mapped to addresses as needed.
• A human friendly name is an identification of a system or protocol instance which is relatively
easy for humans to to read and memorize. Well known examples are Internet domain names
such as www.iu-bremen.de.
• Names often have variable length and the name space is usually structured hierarchically.
• Addresses on the other hand are identifications of protocol instances which are optimized
for machine processing. A typical example are Internet Protocol (IP) addresses such as
212.201.48.1.
• Addresses usually have a fixed length and are relatively compact since they are frequently
transmitted.
ISDN Addresses
The Integrated Services Digital Network (ISDN) is the digital telecommunication network which is
widely available in Europe. ISDN addresses which identify telecommunication equipments such as
phones are structured according to the E.164 numbering plan defined by the ITU:
• An international ISDN phone number consists of a maximum of 15 digits. The first digits
contain the country code, followed by a national region code followed by the phone number
within that region.
1.1. FUNDAMENTAL CONCEPTS
3
• An international ISDN phone number can be followed by an up to 40 digit long target identifier.
• The common notation for international ISDN phone numbers starts with a + symbol followed
by digits which can be grouped into blocks by using white space or other separator characters.
An example would be +49 241 200 3587.
Internet Addresses
Internet network layer addresses have a fixed size. Depending on the protocol version (IPv4 or
IPv6), these addresses are either 4 byte or 16 byte long.
• Four byte IPv4 addresses are typically written as four decimal numbers separated by dots
where every decimal number represents one byte (dotted quad notation). A typical example
is the IPv4 address 212.201.48.1.
• Sixteen byte IPv6 addresses are typically written as a sequence of hexadecimal numbers
separated by colons (:) where every hexadecimal number represents two bytes. Leading
nulls can be omitted and two consecutive colons can represent a sequence of nulls. For
example, the IPv6 address 1080:0:0:0:8:800:200C:417A can be written somewhat
shorter as 1080::8:800:200C:417A. IPv6 addresses which contain IPv4 addresses can
be written by using the dotted quad notation for the IPv4 address portion. For example, the
IPv6 address 0:0:0:0:0:0:0D01:4403 can be written as ::0D01:4403 as well as
::13.1.68.3.
Further details about IPv6 addresses can be found in RFC 3513 [8]. A more compact representation
of IPv6 addresses can be found in RFC 1924 [9] (recommended reading). In this context, also RFC
1925 [?] is highly recommended background reading material.
IEEE 802 MAC Addresses
IEEE 802 addresses, sometimes also called MAC addresses, are usually 6 bytes or 48 bit long.
(There are also 2 byte or 16 bit IEEE 802 addresses which however do not play a significant role.)
• The common notation for IEEE 802 addresses is a sequence of hexadecimal numbers (one
number for each address byte) where the numbers are separated from each other using colons
or hyphens. Typical examples are 00:D0:59:5C:03:8A or 00-D0-59-5C-03-8A.
• The highest bit of an IEEE 802 address indicates whether it is a normal unicast address (0)
or a multicast address (1). The broadcast address, which represents all stations within a
broadcast domain, consists of 48 one bits.
• The second highest bit of an IEEE 802 address defines whether it is a local address (1) or a
global address (0). A local address is assigned administratively and only unique within this
administrative region while global addresses are globally unique.
• Globally unique IEEE 802 addresses are created by vendors who have to apply for a number
space by the IEEE. The vendor then assigns a unique number taken from the address space
delegated to him. It is thus possible to identify the vendor of a network device by looking up
the vendor code (the first three bytes) in a number space delegation list.
4
CHAPTER 1. INTRODUCTION
Internet Domain Names
Internet addresses are optimized for machine processing and storage and not necessarily for human
memories. This lead to the introduction of names which are more oriented towards the requirements
of human beings.
virtual Root
nl
de
edu
org
net
com
Toplevel
2nd Level
iu−bremen
www
biz
eecs
3rd Level
www
4th Level
Figure 1.1: Structure of Domain Name System (DNS) Names
• The Domain Name System (DNS) defines a distributed hierarchical name space which in
particular supports the delegation of name assignments.
• In many cases, the structure of the DNS name space reflects the organizational structure of
the organization which maintains the relevant part of the DNS name space.
• When using DNS names to refer to a node on the Internet, a process called name resolution
is performed which translates the DNS name to one or more IP addresses.
• The traditional and widely deployed DNS does not support internationalized domain names.
A special encoding has therefore been defined recently to support internationalized domain
names without any changes to the DNS infrastructure.
1.1.4
ISO/OSI Reference Model
The ISO/OSI Reference Model is the classic layered model for communication networks which was
developed during the ISO work on the Open Systems Interconnection (OSI). Real networks usually
do not follow strictly the seven layer OSI model.
Physical Layer
• Transmission of a sequence of bits over a transmission media.
• Definition of the properties of the physical media.
• Representation of the binary values 0 and 1 (e.g., voltages, frequencies).
• Synchronization between sender and receiver.
• Definition of standards for connectors and sockets.
5
Application Process
Application Process
End System
End System
Application
Presentation
Presentation
Session
Session
Transport
Transport
Transitsystem
Transport System
Network
Network
Network
Data Link
Data Link
Data Link
Physical
Physical
Physical
Medium
Media
Figure 1.2: ISO/OSI reference model
Data Link Layer
• Transmission of larger bit sequences in so called frames.
• Data transfer between systems connected to the same medium.
• Detection and correction of transmission errors.
• Flow control to adapt the speed between senders and receivers.
• Realization usually in hardware.
Network Layer
• Determination of paths through a complex communication network.
• Multiplexing of end system connections over intermediate systems.
• Error detection and correction between sending and receiving network nodes.
• Flow and congestion control between end systems.
• Transmission of datagrams or packets in packet switched networks.
Transport Layer
• End-to-end communication channels between applications.
Transport System
Application System
Application
Aplication System
1.1. FUNDAMENTAL CONCEPTS
6
CHAPTER 1. INTRODUCTION
• Virtual connections over connection-less datagram services in packet switched networks.
• Error detection and correction between transport layer endpoints.
• Flow and congestion control between transport layer endpoints.
Session Layer
• Synchronization and coordination of communicating processes.
• Interaction control (check points).
Presentation Layer
• Harmonization of different data representations.
• Serialization of complex data structures.
• Data compression.
Application Layer
• Realization of fundamental application oriented services.
• Examples: Terminal emulationen, management of name spaces, data base access, network
management, electronic messaging systems, process and machine control, . . .
1.1.5
Internet Reference Model
Application Process
Application Process
End System
End System
Application
Application
Transport
Transport
Transit System
Internet
Network
Internet
Subnetwork
Subnetwork
Subnetwork
Medium
Medium
Figure 1.3: Internet reference model
7
1.2. STANDARDIZATION
• The Internet has been designed as a network which can be implemented on top of almost any
other communication network by making very few assumptions about the services provided
by the underlying communication networks. Accordingly, the layer below the Internet layer
(which basically corresponds to the network layer of the OSI reference model) is called a
subnetwork (see RFC 1149 [10] for an interesting example of a subnetwork).
• The Internet Protocol (IP) provides a common basis which allows to cross boundaries imposed by various other network technologies.
• The Internet Protocol can of course also be used as a subnetwork technology, which naturally
leads to so called IP tunnels.
• There are currently two protocols on the Internet network layer. The currently widely deployed IP protocol is version 4 (IPv4). The IP protocol version 6 (IPv6) is slowly gaining
deployment and practical importance.
• Internet protocols are often designed to simplify implementations (usually in software, even
though high-speed devices implement many protocols in hardware).
• The Internet protocols are primarily designed for data communication (asynchronous, besteffort) and only recent work tries to support voice and multi-media communication (isochronous traffic and quality of service).
• Implementations of many Internet protocols are freely available which helps to transfer the
protocols from research/development into actual products. Universities and research labs
traditionally play a big role as a melting pot and experimentation field for new protocols.
1.2
Standardization
The standardization of protocols creates unified network architectures supporting open (that is vendor independent) communication. Vendor specific protocols and architectures (e.g., SNA or DECnet) have lost importance.
Activity
Time for Standardization
Research
Investment
Time
Figure 1.4: Theory of Standardization
Standardization itself is a complicated and in most cases a time consuming and thus expensive
process. However, once an open standard has been established, it can create an open competitive
market which leads to the development of high-quality products which are usually available at very
8
CHAPTER 1. INTRODUCTION
reasonable prices. However, only a very small fraction of the developed standards are actually
successful in terms of wide-spread deployment:
• The success of a standard must be measured in the number of actually deployed interoperable
implementations.
• Standards must allow vendors to differentiate their products.
• Successful standards create an open market for new products.
• One critical factor for the success of a standards activity is the timing.
There are many organizations which develop standards for communication networks. The most
important organizations and their standards processes are briefly introduced in the following subsections.
1.2.1
ISO Standardization
• The International Organization for Standardization (ISO) is an organization for establishing
international standards. ISO standards cover a wide spectrum of things, such as paper sizes
or screws. Note that the abbreviation ISO stems from the Greek word isos, meaning ”equal”.
• ISO is a network of the national standards institutes of almost 150 countries, on the basis of
one member per country (ANSI for the USA, DIN for Germany), with a Central Secretariat
in Geneva, Switzerland, that coordinates the system.
• The ISO standardization process distinguished three states:
1. Draft Proposal (DP)
2. Draft International Standard (DIS)
3. International Standard (IS)
The transition between these states requires majorities during voting processes and transitions
can be repeated multiple times.
• Standards are identified by numbers. Different revisions of the same standard are published
under the same number. To distinguish the revisions, the year of the publication is usually
appended to the number of a standard.
• The Open Systems Interconnection (OSI) maintains the standards which deal with communication in open (communication) systems.
1.2.2
Internet Standardization
• The Internet Engineering Task Force (IETF) is responsible for the standardization of the
Internet protocols (RFC 3233 [11], RFC 2026 [12]).
• Internet standards are usually developed by working groups (WGs) which are organized in
different areas (e.g., routing or transport).
9
1.2. STANDARDIZATION
Historic
Working Group
Document
(Internet Draft)
Proposed Standard
Historic
Draft Standard
(RFC)
(RFC)
Historic
Internet Standard
(RFC)
Figure 1.5: Internet standardization process model
• Every area is lead by usually two area directors (ADs). All the area directors together form
the Internet Engineering Steering Group (IESG), which has to approve all documents on the
standardization track.
• The IETF standardization process distinguishes three states:
1. Proposed Standard
2. Draft Standard
3. Internet Standard
The transitions between these states require usually “rough consensus and running code.”
Multiple interoperable independent implementations are required to move from Proposed
Standard to Draft Standard and real-world deployment is required to move from Draft Standard to Internet Standard.
• All standards are published as so called Request for Comments (RFCs). Every RFC has
a unique number and RFCs are never changed after publication. Different revisions of a
standard thus have different RFC numbers. There are special documents which help to locate
the current RFC number for a given standard. Note: Not all RFCs are standards! There
are also informational and experimental RFCs as well as RFCs which document best current
practices.
• The Internet Architecture Board (IAB) is a panel which looks at longer-term architectural
issues and sometimes gives advise to the IETF.
• The Internet Research Task Force (IRTF) is an organization that exists in parallel to the IETF
and which looks at research questions, potentially preparing future standardization work.
The IRTF is similarly to the IETF structured into research groups. The chairs of the research
groups together form the Internet Research Steering Group (IRSG).
1.2.3
IEEE Standardization
Standardization within the IEEE is organized and controlled by the IEEE-SA Standards Board. The
documents created by standardization activities fall into the following categories:
• Standards are documents which define IEEE Standards.
• Recommended Practices can define procedures.
• Guides discuss alternate approaches and can provide additional background information.
10
CHAPTER 1. INTRODUCTION
• Trial-Use Documents exist only for a limited period of time.
An IEEE standardization project can produce different classes of documents:
• A new document (New) defines a standard which is not a revision of an already existing
standard.
• An already existing standard can be updated and replaced by a document which is called a
Revision.
• A Corrigenda is a document which makes substantial corrections in another standards document.
• An existing standard can be extended by another document which can also make substantial
corrections. Such a document is called an Amendment.
The IEEE is called a sponsor and responsible for the creation and process management of a standardization project. A project starts by submitting a Project Authorization Request (PAR). The
IEEE-SA Standards Board is the board which decides whether a PAR is accepted. PARs are evaluated by the New Standards Committee (NesCom).
Technical work takes places in so called working groups and is finalized by a voting procedure
(ballot). It is generally desired to avoid negative votes by achieving consensus before the final ballot.
After a successful ballot, the draft of the new standard is submitted to the IEEE-SA Standards Board
for approval. The IEEE-SA Standards Board itself makes use of a Review Committee (RevCom)
which helps to review the documents and to form an opinion.
Chapter 2
IEEE 802 Local Area Networks
The 802.x series of IEEE standards are under development since the middle of the 1980s. They
dominate the technology used in local area networks (LANs) and there are currently a trend to
use the IEEE 802.x specifications also in metropolitan area networks (MANs). Some of the IEEE
standards have also been approved as official ISO standards.
802.1 Management
802 Overview and Architecture
802.2 Logical Link Control
802.1 Bridging
802.3
Medium
Access
802.4
Medium
Access
802.5
Medium
Access
802.6
Medium
Access
Ethernet
Token Bus
Token Ring
DQDB
802.3
Physical
802.4
Physical
802.5
Physical
802.6
Physical
802.9
Medium
Access
802.11
Medium
Access
802.12
Medium
Access
WaveLan
802.9
Physical
802.11
Physical
802.12
Physical
Figure 2.1: Overview over the IEEE 802 standards
The currently most widely known standards are the Ethernet (IEEE 802.3) and WaveLANs (IEEE
802.11). An IEEE standard for bluetooth was approved in March 2002.
The IEEE 802.x standards cover the two lower layers of the OSI reference model. However, the
IEEE 802.x standards subdivide the OSI data link layer into two sub-layer:
• The Logical Link Control (LLC) layer provides a service interface which is the same for all
IEEE 802 protocols. Protocols on the network layer (e.g., the Internet Protocol) use the services provided by the LLC layer and thus work (in principle) over all IEEE 802.x protocols.
(In reality, there are sometime differences with regard to the LLC layer service primitives
supported by a given IEEE 802.x technology that can affect the mapping of network layer
protocols.)
• The Medium Access Control (MAC) layer defines the method used to access the media being
used.
11
12
CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
Application Process
End−System
Application System
Application
Representation
Session
Transport
Data Link
Logical Link Control (LLC)
Media Access Control (MAC)
Physical
IEEE 802
Transport System
Network
Physical (PHY)
Figure 2.2: IEEE 802 layers in the OSI reference model
• The Physical (PHY) layer defines the physical properties for the various transmission media
that can be used with a certain IEEE 802.x protocol.
The split of the data link layer into two sub-layer has been a very important decision which enabled
the IEEE to standardize very different media access technologies and protocols with a common data
link interface.
2.1
Logical Link Control (IEEE 802.2)
The Logical Link Control layer is modeled after the ISO service model and provides services that
are close to those offered by the HDLC protocol discussed in the second year lecture “Operating
Systems and Networks”. Note that not all services are realized by all existing IEEE 802.x protocols.
2.2
Ethernet (IEEE 802.3)
The IEEE 802.3 standard is probably better known as Ethernet1 . The Ethernet technology was
developed in the 1970s at XEROX PARC [13] and was later standardized with little changes by the
IEEE [14]. The classic IEEE 802.3 network is a 1-persistent CSMA/CD network with a bandwidth
of 1-10 Mbps.
1
The term Ethernet is usually used synonymously for the IEEE 802.3 standards and the CSMA/CD technology in
general, although this is not really correct.
13
2.2. ETHERNET (IEEE 802.3)
1976
1990
1995
1998
2002
2006*
2008*
2010*
Original Ethernet paper [13] published
10 Mbps Ethernet over twisted pair (10BaseT)
100 Megabit Ethernet
1 Gbps Ethernet
10 Gbps Ethernet
100 Gbps Ethernet (predicted)
1 Tbps Ethernet (predicted)
10 Tbps Ethernet (predicted)
Table 2.1: Evolution of the Ethernet technology
Since the IEEE 802.3 technology was very successful, the IEEE started efforts to define extensions
for 1 Gbps, 10 Gbps networks and so on. In June 2002, an IEEE standard for 10 Gbps Ethernets
was approved while the standard for 100 Gbps Ethernet is under development. The evolution of the
Ethernet standards is summarized in Table 2.1.
2.2.1
Physical Layer (PHY)
The physical layer of the IEEE 802.3 standard defines the transmission related properties. The
following medias and topologies are defined:
name
10Base2
10Base5
10BaseT
10BaseF
medium
coax, ø=0.25 in
coax, ø=0.5 in
twisted pair
fiber optic
max. length
200 m
500 m
100 m
2000 m
max. stations
30
100
1024
1024
topology
bus
bus
star
star
Table 2.2: IEEE 802.3 physical layer media and topologies
The different medias have different signal propagation delays. The speed of light c is approximately
c ≈ 300000 km
s . The speed of the various medias can be expressed relative to the speed of light as
shown in Table 2.3.
medium
thick coax
thin coax
twisted pair
fiber optic
signal propagation speed
0.77c ≈ 231, 000 km
s
0.65c ≈ 195, 000 km
s
0.59c ≈ 177, 000 km
s
0.66c ≈ 198, 000 km
s
Table 2.3: Signal propagation speeds for various IEEE 802.3 physical layer media
The 10Base5 medium, a rather thick copper coax wire, was also known as “yellow cable.” Stations
were attached to a yellow cable by drilling a hole into the coax cable and sticking a needle into the
heart of the cable. The 10Base2 medium, also sometimes called “cheaper net”, was easier to deploy
since it was more flexible and stations were by means of so called T-connectors. The downside
of this technology was that segments were more significantly limited in size and the number of
stations that could be supported. The fiber optic medium on the other hand supported a much larger
distance, but was rather expensive to deploy.
14
CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
2.2.2
Medium Access Layer (MAC)
preamble
7 Byte
start−of−frame delimiter (SFD)
1 Byte
destination MAC address
6 Byte
source MAC address
6 Byte
length / type field
2 Byte
data
(network layer packet)
64−1518 Byte
The IEEE 802.3/Ethernet frame format is rather simple and shown in Figure 2.3.
46−1500 Byte
padding (if required)
frame check sequence (FCS)
4 Byte
Figure 2.3: IEEE 802.3 frame format
The various fields in the frame serve the following purposes:
• The seven byte preamble consists of the bit pattern 101010102 . This pattern together with
the Manchester Coding technique results in a periodic signal which allows the receiver to
synchronize to the speed of the sender.
• The start-of-frame delimiter (SFD) has the bit pattern 101010112 . The resulting signal change
at the end of the start-of-frame delimited after the preamble indicates that start of a frame.
• The source and destination address fields contain six byte IEEE MAC addresses.
• The two byte type/length field contains either the length of the frame (value less than 60016 )
or the identification of higher level protocol used by the data carried in the frame (value
greater or equal to 60016 ). Type numbers are maintained by the IEEE and globally unique.
• The data portion contains the actual payload, usually a packet of a network layer protocol. If
necessary, the frame will be filled with padding bytes to achieve a minimal frame length.
• The end of the packet contains a four byte CRC frame checksum (CRC-32).
IEEE 802.3 uses the CSMA/CD medium access method. Figures 2.4 and 2.5 show the principal
logic that is used to send and receive frames.
15
2.2. ETHERNET (IEEE 802.3)
wait for frame to transmit
format frame for transmission
carrier sense signal on?
Y
N
wait interframe gap time
start transmission
collision detected?
Y
N
complete transmission and
set status transmission done
transmit jam sequence and
increment # attempts
attempt limit reached?
set status attempt limit exceeded
Y
N
compute and wait backoff time
Figure 2.4: IEEE 802.3 MAC logic for sending frames
The following parameters play a role for a classic 10 Mbps IEEE 802.3 network:
• The slot time of 512 bit times equals twice the propagation delay plus some safety margin.
• Between two successive frames, a minimum inter-frame gap of 96 bit times is required to
ensure that frames ends are properly recognized.
• The minimal length of a frame is 64 byte; the maximum length is 1518 byte.
• If a collision has been detected, a special jam-signal is generated for the duration of 32 bit
times.
• The transmission of a frame will be (re)tried up to a maximum of 16 times in case of collisions. Once a collision has been detected by the sending station, the station waits a random
number R of slot times before retrying the transmission of the frame.
• On the n-th retransmission, a uniformly distributed number R is chosen from the interval
[0..2k ) with k = min(n, b) and the bake-off-limit b = 10.
16
CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
N
incoming signal detected?
Y
set carrier sense signal on
obtain bit sync and wait for SFD
receive frame
FCS and frame size OK?
Y
destination address matches
own or group address?
N
N
Y
pass data to higher-layer
protocol entity for processing
discard frame
Figure 2.5: IEEE 802.3 MAC logic for receiving frames
There are a number of special situations which can be recognized by the MAC layer:
• Received frames with a non-integral number of bytes which fail the CRC test (alignment
errors).
• Received frames with a legal length that fail the CRC test (frame check sequence (FCS)
errors).
• Frames that could not be transmitted immediately since the medium was busy (deferred transmissions).
• Frames which were transmitted successfully after a single collision (single collision frames).
• Frames which were transmitted successfully after multiple collisions (multiple collision frames).
• Frames which could not be transmitted due to continued collisions (excessive collisions).
• Collisions detected after the slot time after the start of the transmission (late collisions).
Collisions that happen after the slot time are typically indications of wires that exceed the
maximum allowed length.
• Not further specified MAC internal errors during the transmission of a frame (internal MAC
transmit errors).
• Not further specified MAC internal errors during the receipt of a frame (internal MAC receive
errors).
• Failure to listen to the carrier signal during a transmission (carrier sense errors).
• Frames that exceed the maximum length of allowed frames (frame too long errors).
17
2.2. ETHERNET (IEEE 802.3)
2.2.3
Fast-Ethernet (IEEE 802.3u)
The classic 10 Mbps IEEE 802.3 standard allows a maximum wire length (including repeaters which
basically amplify the signal) of 2.5km. This results in a maximum propagation delay (inclusive
some detail in repeaters) of 50µs. This leads to a minimum packet length of 512 bit.
The objective of the development of the Fast-Ethernet standard was a data rate of 100 Mbps without
changes to the medium access mechanism. To achieve a higher bit-rate, the maximum length of the
wire has to be reduced. Accordingly, the Fast-Ethernet wire length is limited to 100m. This relatively short length was acceptable since the developers envisioned the transition to star topologies
with twisted pair cables.
Fast-Ethernet can be used with twisted pair and fiber optic cables. The support of UTP Category 3
and 5 cables results in some specialties in the physical layer. The general advise, however, is to use
Category 5 cables (or higher).
name
100BaseT4
100BaseTX
100BaseFX
medium
twisted pair
twisted pair
fiber optic
max. length
100 m
100 m
412 m
Table 2.4: IEEE 802.3U physical layer media and topologies
The 100BaseT4 media uses two twisted pairs while 100BaseTX uses a single twisted pair.
2.2.4
Gigabit Ethernet (IEEE 802.3z/802.3ab)
The Gigabit-Ethernet standard specified in IEEE 802.3z initially supported fiber optic media. Support for category 5 UTP cables was later added by the IEEE 802.3ab specifications.
Gigabit Ethernet can operate in half-duplex and full-duplex mode. In half-duplex mode, the protocol
still uses the CSMA/CD method. To make the use of CSMA/CD possible, the slot time has been
changed from 64 bytes to 512 bytes which means that packets smaller than 512 bytes are augmented
with a new carrier extension field following the CRC field. When operating in full-duplex mode,
the original IEEE 802.3 slot-time is used and frames are not augmented.
New installations usually use Gigabit Ethernet in full duplex mode where frames can be sent and
received simultaneously and where almost all the theoretically available bandwidth can be used to
transmit data.
name
1000BaseLX
1000BaseSX
1000BaseCX
1000BaseT
medium
fiber optic
fiber optic
coax
twisted pair
max. length
500 / 550 / 5000 m
220-275 / 550 m
25 m
100 m
Table 2.5: IEEE 802.3z/802.3ab physical layer media and topologies
18
2.2.5
CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
10 Gigabit Ethernet (IEEE 802.3ae)
The 10 Gigabit Ethernet specification IEEE 802.3ae is a full-duplex and fiber-only technology and
thus does not need the CSMA/CD medium access method anymore. There are two different physical layers specified: The LAN PHY layer is for local area networks while the WAN PHY layer has
an extended feature set compared to the LAN PHY layer.
2.3
Wireless LANs (IEEE 802.11)
The Wireless LAN (WaveLan) standard specified in IEEE 802.11 is rather different from the IEEE
802.3 standards. It uses the MACA medium access method where small RTS/CTS frames are
exchanged before the data is actually transmitted.
Wireless LANs support two modes of operation. In the ad-hoc mode, stations are brought together
to form a network on the fly. An election algorithm is used to elect one station which serves as
the master while the other stations become slaves. The second mode assumes the presence of some
fixed network access points (also sometimes called base stations) with which mobile stations can
communicate.
2.4
Bluetooth LANs (IEEE 802.15)
The Bluetooth standard specified in IEEE 802.15 provides a wireless network technology for rather
small cells and is typically used to create wireless personal area networks. Typical bluetooth devices
are PDAs or wireless headsets which can communicate with a PC or Laptop system. Due to the
relatively small area covered by IEEE 802.15, it is possible to save quite some energy compared to
the IEEE 802.11 family of standards.
2.5
Port Access Control (IEEE 802.1X)
Port-based network access control as defined in IEEE 802.1X makes use of the physical access
characteristics of IEEE 802 LAN infrastructures in order to provide a means of authenticating and
authorizing devices attached to a LAN port that has point-to-point connection characteristics, and
of preventing access to that port in cases in which the authentication and authorization process fails.
A port in this context is a single point of attachment to the LAN infrastructure. Examples of ports in
which the use of authentication can be desirable include the ports of MAC Bridges (as specified in
IEEE 802.1D), the ports used to attach servers or routers to the LAN infrastructure, and associations
between stations and access points in IEEE 802.11 wireless LANs.
19
2.6. BRIDGES
2.6
Bridges
Multiple IEEE 802 LAN segments can be interconnected by using so called bridges. By using
bridges, it does not really matter which IEEE 802 technology is used in the segments that are to be
connected. Examples are big Ethernet LANs that consists of multiple Ethernet segments and also
include Wireless LAN segments.
802.11
B3
10Base5
B1
B2
10Base2
100BaseT
10Base2
802.5
Figure 2.6: Bridges are used to interconnect different LAN segments
Bridges (sometimes also called layer two switches) have a number of advantages:
1. Different IEEE 802 LAN technologies (e.g., Ethernet, Token Ring, WLAN) can be interconnected.
2. Geographically dispersed LAN segments can be connected by using different medias in the
backbone segments (e.g., fiber) and the access segments (e.g., twisted pair).
3. Highly loaded LAN segments can be split into smaller segments which improves their performance.
4. Bridges can improve the robustness of the network since errors are better localized (due to
smaller segments) and since bridges offer the possibility to have multiple redundant paths in
the network.
20
CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
5. Bridges can improve to some extend the security of the network since traffic can be better
restricted to the shorter local LAN segments.
Bridges operate on the IEEE 802 LLC layer as shown in Figure 2.7 and this is the reason why
different IEEE 802 technologies can be crossed via a bridge.
Network
Network
Bridge
IEEE 802.2 LLC
IEEE 802.2 LLC
IEEE 802.2 LLC
IEEE 802.3 MAC
IEEE 802.3 MAC
IEEE 802.11 MAC
IEEE 802.11 MAC
IEEE 802.3 PHY
IEEE 802.3 PHY
IEEE 802.11 PHY
IEEE 802.11 PHY
Figure 2.7: IEEE 802 bridge connecting an IEEE 802.3 and an IEEE 802.11 segment
Although conceptionally simple, there are some issues one has to pay attention to:
• Different LAN segments usually operate at different speeds in terms of bits per second. A
bridge connecting such segments should have some buffering capacity to handle traffic bursts
or peaks (but of course, every puffer has a limited size).
• Different LAN segments may have different maximum frame sizes. A bridge receiving a
frame which exceeds the maximum frame size of the destination LAN segment can only drop
that frame.
• Different LAN segments which operate at different speeds may confuse timers at higher
protocol levels that are not aware of the bridging situation.
• Some LAN technologies support priorities while others do not.
• Some LAN technologies are real-time capable while others are not.
• Some LAN technologies signal the delivery of a frame to the sender which others do not.
There are two basic types of bridges: source routing bridges and transparent bridges. Both of them
are discussed in the next sections.
2.6.1
Source Routing Bridges
Source Routing Bridges assume that a sending station can distinguish between stations attached
to the local LAN segment and stations that are attached to remote LAN segments. If a frame has
to be send to a station connected to a remote LAN segment, the sender first has to determine the
path to the remote LAN segment before sending the frame along this path. The path to follow is
21
2.6. BRIDGES
actually encoded and sent along with the frame. A special protocol is used by the stations for locate
destination stations and to find suitable routes.
The advantage of this approach is that one can make efficient use of the available bandwidth by
utilizing redundant paths to the receiving station. The price is, however, increased complexity in
the end systems that participate in a source routing bridged network.
2.6.2
Transparent Bridges (IEEE 802.1D)
Transparent bridges (sometimes also called spanning tree bridges) do not need special software
on the stations nor to they need a manual configuration. Instead, they adapt to their environment
automatically and are thus fully transparent from the view of the network used or (to some extend)
the network operator. The price for this is that not all available bandwidth in a bridged network can
be used to its full potential.
LAN segments are connected to transparent bridged through so called ports. The simplest of all
transparent bridges has two ports. Today, it is not unusual to have bridges which have hundreds
of ports that are realized on multiple modules interconnected by a high-speed backplane network.
Many of the commercial products can be stacked so that a bridge can grow in the number of ports
and the number of IEEE 802 technologies supported on the ports.
Forwarding
database
Port
management
software
MAC
chipset
Port 1
Station
Port
address number
Bridge
protocol
entity
Memory
buffers
MAC
chipset
Port 2
Figure 2.8: Internal structure of transparent bridges
Bridges can receive frames on multiple ports simultaneously. It it therefore necessary to have some
buffer space to hold incoming frames. The ports of a transparent bridge generally work in the
promiscuous mode which allows to receive all frames on the segment and not only the frames that
are destined to the bridge.
A transparent bridge internally maintains a forwarding database which maps received destination
MAC addresses to outgoing port numbers.
• When a frame has been received by a transparent bridge, the forwarding database is checked
to find an entry which matches the destination address contained in the received frame.
22
CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
• If a matching entry has been found and if the port number associated with the MAC address is
not equal to the port number from which the frame was received, then the frame is forwarded
to the port indicated by the forwarding database entry. The frame is discarded if the port
number of the forwarding database entry is identical to the port number from which the frame
was received.
• If no matching entry can be found in the forwarding database, then the frame is forwarded to
all ports except the port from which the frame was received (flooding).
• Many bridges also support a feature which allows network operators to configure that the
forwarding function is disabled for certain MAC addresses.
Backward Learning
The forwarding database must be populated and it must adapt to changes of the network topology
dynamically. One usually solves this problem by learning the current configuration from the frames
received by the bridge. Learned entries in the forwarding database have a timer attached to expire
these entries in case no other matching packets have been received:
• The forwarding database is initialized to be empty when a bridge boots or is reinitialized.
• When a bridge receives a frame which does not yet exist in the forwarding database, then
it extracts the source address and determines the port number from which the frame was
received. The source address and the port number are then stored in the forwarding database.
The frame is then forwarded to all other ports (which also propagates information to other
bridges).
• Every entry in the forwarded database has a timer attached to it. Entries are automatically
discarded if they have not been confirmed by additional received frames within a certain time
interval (soft state).
• The aging of unused entries reduces the size of the forwarding table and allows bridges to
react to topology changes dynamically (after a short delay).
• The backward learning algorithm only works if the topology is a strict tree and does not
contain cycles. In case of multiple paths between LAN segments, it is possible that entries in
the forwarding database are overwritten periodically. This behavior of a network is not stable
in such a situation.
Spanning Trees
Bridged networks which do not have a loop-free tree structure cause problems since frames might
travel endlessly in a loop (ping-pong) when using backward learning alone. Transparent bridges
therefore construct a spanning tree in these cases which is used to restrict how frames are forwarded.
The spanning tree protocol requires a unique identification of the bridges involved. The so called
bridge identifier consists of one of the MAC addresses (six bytes) of a bridge plus a priority value
(two bytes). The priority value can be set administratively to influence the spanning trees computed
with the spanning tree protocol.
The spanning tree protocol executes in the following steps:
2.7. VIRTUAL LANS (IEEE 802.1Q)
23
1. In the first step, the root of the spanning tree is selected (root bridge). The root bridge is the
bridge with the highest priority and the smallest bridge address. The root of the spanning tree
is periodically broadcasted and will be recomputed as needed.
2. In the second step, the costs for all possible paths from the root bridge to the various ports on
the bridges is computed (root path cost). Every bridge determines which local port is used to
reach the root bridge at the lowest costs. The selected port is called the root port.
3. In the third step, the designated bridge is determined for each segment. The designated bridge
of a segment is the bridge which connects the segment to the root bridge with the lowest costs
on its root port. At equal costs, the bridge with the lowest bridge identifier wins. The port
used to reach designated bridges are called designated ports.
4. Finally, all ports are blocked which are not designated ports. The resulting active topology is
a spanning tree.
The spanning tree protocol uses so called BPDUs to distribute information. A BPDU has the structure shown in Figure 2.9.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Protocol Identifier
|
Version
|
BPDU Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Flags
|
=
+-+-+-+-+-+-+-+-+
Root ID
+
=
=
+
+-----------------------------------------------+
=
|
Root Path Costs
=
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
=
|
=
+-+-+-+-+-+-+-+-+
Bridge ID
+
=
=
+
+-----------------------------------------------+
=
|
Port ID
|
Message
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Age
|
Maximum Age
|
Hello
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Timer
|
Forward Delay
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2.9: Bridge PDU (BPDU) format
2.7
Virtual LANs (IEEE 802.1Q)
Virtual LANs (virtual bridged lans, VLANs) emulate a virtual LAN segment on top of a complex
IEEE 802 bridged network.
24
CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
B1
B2
B0
Figure 2.10: Virtual LANs
VLANs allow to separate the traffic on an IEEE 802 network which has several advantages:
• A station connected to a certain VLAN only sees frames that belong to the VLAN.
• VLANs can reduce the network load. In particular, frames that are targeted to all stations
(broadcasts) will only be delivered to the stations connected to the VLAN.
• It is possible that a station is a member of multiple VLANs simultaneously. This allows to
use for example a central server from multiple VLANs.
• By assigning stations to VLANs, it is possible to create logical LAN topologies that are
independent of the underlying physical LAN topology.
A VLAN is identified by a VLAN identifier (1..4094) and realized by VLAN supporting bridges.
The assignment of bridge ports to VLANs can be done in different ways:
• Port based VLANs: The ports of a bridge are assigned administratively to the various VLANs.
A single port can in general participate in multiple VLANs.
• MAC address based VLANs: The MAC addresses of the stations are assigned administratively to the various VLANs. With this scheme, it does not matter on which port a given
station connects to a bridge.
• Protocol based VLANs: Frames are assigned to VLANs by inspecting the payload contained
in the frames. This technique allows to create VLANs for e.g., Appletalk or IPX frames.
• Multi-cast group based VLANs: VLANs are defined for all members of a certain multi-cast
group. This requires a multi-cast group membership protocol to be effective.
25
2.7. VIRTUAL LANS (IEEE 802.1Q)
7 Byte
start−of−frame delimiter (SFD)
1 Byte
destination MAC address
6 Byte
source MAC address
6 Byte
tag protocol identifier
2 Byte
priority
CFI
preamble
vlan identifier
length / type field
2 Byte
64−1522 Byte
On links that carry frame which belong to different VLANs, it is necessary to tag the frames with
the VLAN identifier. In the case of Ethernet frames, an new field called the tag header is introduced
right after the destination and source addresses, as shown in Figure 2.11.
data
(network layer packet)
46−1500 Byte
padding (if required)
frame check sequence (FCS)
4 Byte
Figure 2.11: IEEE 802.3 tagged frame format
The introduction of VLAN tags has some implications:
• Tagged frames can exceed the maximum frame lengths accepted by stations which do not
support VLANs.
• The IEEE 802.1Q standard generally requires that frames which exceed the maximum allowed length are discarded.
• In the case of IEEE 802.3 frames, an extension of the original frame of four bytes has been
granted (which changes the maximal length of a frame from 1518 bytes to 1522 bytes).
The IEEE 802.1Q standard also introduces the Generic Attribute Registration Protocol (GARP),
which can among other things propagate information about VLAN membership of individual ports.
This information can be used by VLAN enabled devices to suppress frames for VLANs which
currently have no members.
26
2.8
CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
LAN Priorities (IEEE 802.1D)
The 1998 revision of the IEEE 802.1D standard introduces additional support for priorities and
quality of service support. (The additions were developed under the name IEEE 802.1p and are still
referred to by this name.) The original IEEE 802.3 and IEEE 802.11 frame formats do not allow
to communication priorities. When using IEEE 802.1Q VLAN frames, priorities can be encoded in
the 3-bit priority field of the four byte VLAN tag.
The core idea behind 802.1D priority extensions is to support bridges that have multiple output
queues for each port. Within a bridge, frames are assigned to certain traffic classes based on the user
priority (usually carried with the frame) and the access priority (which is the priority associated with
the media access mechanism). The traffic class of a frame is then used to select the queue where the
frame is queued for transmission. Note that a bridge must preserve the ordering of unicast frames
with a given combination of source and destination addreses and the order of multicast frames with
a given destination address.
Chapter 3
Internet Network Layer
The Internet Protocol(s) developed and standardized by the IETF currently dominate the network
layer in data communication networks and they reach out into voice and multi-media communication networks. The widely deployed version of the Internet Protocol (IP) is version 4 (IPv4) while
version 6 (IPv6) is currently gaining more deployment and thus practical relevance. This chapter
is centered around these two protocols. It also discusses the protocols which support the network
layer such as routing protocols or protocols which aim to automate the end system configuration
process.
3.1
Fundamentals
First, we consider some fundamentals which are important to understand the design of the Internet
protocols.
3.1.1
Evolution of the Internet
In the mid 1970s, the Defense Advanced Research Project Agency (DARPA) of the USA started
projects to develop inter-networking technologies. These projects led to the ARPANET, a packet
switched network running on top of leased lines. The ARPANET later became a backbone network
connecting all the major Universities in the USA.
An implementation of the Internet protocols became a part of the BSD Unix operating system in the
early 1980. The BSD Unix system became very popular in research organizations and that made
the Internet protocols deployed in research environments. The integration of the Internet protocols
into the BSD Unix also led to the development of the so called socket application programming
interface (API) which became the defacto standard operating system-level API to write networked
applications.
In 1983, the ARPANET is split into the ARPANET research network and the MILNET for use by
the US militaries. The ARPANET research network becomes the NSFNET in 1986, which is now
funded by the National Science Foundation of the USA. In 1990, the NSFNET backbone turns into
the ANSNET operated jointly by MERIT, MCI and IBM. In the early 1990s, the World Wide Web
is born at CERN in Switzerland. More details about the evolution of the Internet can be found at
the Web page of the Internet Society1 .
1
http://www.isoc.org/
27
28
CHAPTER 3. INTERNET NETWORK LAYER
3.1.2
Internet Design Principles
There are a number of fundamental principles which were followed during the development of
Internet protocols. Some of the principles described in RFC 1958 [15] are:
• The first principle is that connectivity is its own reward. The idea here is that connectivity
across different link and transmission technologies is more valuable than any individual application such as email or the World-Wide Web. The technique to realize connectivity is to
realize an inter-networking layer which puts only very basic requirements on the underlying
link and transmission technologies.
• All functions which require knowledge of the state of end-to-end communication should be
realized at the endpoints and not inside of the network (end-to-end argument). In other words,
end-to-end protocol design should not rely on the maintenance of state (i.e. information
about the state of the end-to-end communication) inside the network. Such state should be
maintained only in the endpoints, in such a way that the state can only be destroyed when the
endpoint itself breaks (known as fate-sharing).
Of course, to perform its services, the network maintains some state information such as
routes and the like. This state must be self-healing; adaptive procedures or protocols must
exist to derive and maintain that state, and change it when the topology or activity of the
network changes. The volume of this state must be minimized, and the loss of the state must
not result in more than a temporary denial of service given that connectivity exists. Manually
configured state must be kept to an absolute minimum.
• There is no central instance which controls the Internet and which is able to turn it off.
• Addresses should uniquely identify endpoints. Dynamic changes within the network should
be possible without having to change the identification of the end-systems.
• Intermediate systems should be stateless wherever possible. If state is necessary, it should be
attached to a timer and not static (soft state).
• To increase interoperability, implementations should be liberal in what they accept and stringent in what they generate. Interoperability is more important than strict correctness.
• Keep it simple. When in doubt during design, choose the simplest solution.
It is also important to consider that protocols sometimes show effects when used on a larger scale
that can not be observed on small scales. This often comes from interactions between layers or features. One approach to address these issues is to keep complexity down to a minimum by following
the simplicity principle discussed further in RFC 3439 [16].
3.1.3
Basic Terminology
It is necessary to introduce some terminology. These lecture notes use the terminology as defined
in RFC 2460 [17]. Some older books and documents do not necessarily use the same terminology
and it is thus sometimes necessary to mentally map terms when reading other documents.
• A node is a device which implements an Internet Protocol (such as IPv4 or IPv6).
3.1. FUNDAMENTALS
29
• A router is a node that forwards IP packets not addressed to itself.
• A host is any node which is not a router.
• A link is a communication channel below the IP layer which allows nodes to communicate
with each other (e.g., an Ethernet).
• The neighbors is the set of all nodes attached to the same link.
• An interface is a node’s attachement to a link.
• An IP address identifies an interface or a set of interfaces.
• An IP packet is a bit sequence consisting of an IP header and the payload.
• The link MTU is the maximum transmission unit, i.e., maximum packet size in octets, that
can be conveyed over a link.
• The path MTU is the the minimum link MTU of all the links in a path between a source node
and a destination node.
3.1.4
Autonomous Systems
The global Internet consists of a set of so called autonomous systems which are inter-connected. An
autonomous system (AS) is basically a set of routers and networks under the same administration.
• An autonomous system is identified by a number, the so-called AS number. The number
space is currently restricted to 16 bits, which is becoming problematic.
• IP packets are forwarded between autonomous systems over paths that are established by an
Exterior Gateway Protocol. The internal structure of an autonomous system is irrelevant for
the protocol establishing paths between autonomous systems.
• Within an autonomous system, IP packets are forwarded over paths that are established by an
Interior Gateway Protocol.
The introduction of autonomous systems and the distinction between interior and exterior routing
protocols implies a two-level Internet routing architecture.
Autonomous systems can be classified as follows:
• A stub AS only has a single connection to another AS.
• A multihomed AS has multiple connections to other ASes but does not forward transit traffic.
• A transit AS has multiple connections to other ASes and carries local as well as transit traffic.
30
CHAPTER 3. INTERNET NETWORK LAYER
3.1.5
Internet Address Scopes
Internet addresses do not all have the same scope of uniqueness. While most IP addresses have
global scope, some addresses are only guaranteed to be unique on a certain interface while others
are only guaranteed too be unique on a certain link.
The scope of an Internet address is a topological span within which the address may be used as
a unique identifier for an interface or a set of interfaces. A scope zone, or simply a zone, is a
concrete connected region of topology of a given scope. Note that a zone is a particular instance of
a topological region, whereas a scope is the size of a topological region.
--------------------------------------------------------------| a node
|
|
|
|
|
| /--link1--\ /--------link2--------\ /--link3--\ /--link4--\ |
|
|
| /--intf1--\ /--intf2--\ /--intf3--\ /--intf4--\ /--intf5--\ |
--------------------------------------------------------------:
|
|
|
|
:
|
|
|
|
:
|
|
|
|
(imaginary
=================
a pointa
loopback
an Ethernet
to-point
tunnel
link)
link
Since Internet addresses on devices that connect multiple zones are not necessarily unique, an additional zone index is needed on these devices to select an interface or a set of interfaces.
3.2
Internet Protocol Version 4 (IPv4)
The Internet Protocol version 4 (IPv4) was standardized in 1981 and is documented in RFC 791
[18]. The IPv4 protocol is the basis of today’s global Internet. The original IPv4 specification has
been adopted to emerging requirements during the last 20 years. The following description of IPv4
describes the current interpretation of IPv4.
3.2.1
IPv4 Addresses
The principal structure and the textual representation of IPv4 addreses has already been introduced
in chapter 1.
• For forwarding purposes, IPv4 addresses are divided into a part which identifies a network
(netid) and a part which identifies an interface of a node within that network (hostid).
• The number of bits of an IPv4 address which identifies the network is called the address
prefix. The address prefix is commonly written as a decimal number, appended to the usual
IPv4 address notation by using a slash (/) as a separator (e.g., 192.0.2.0/24).
31
3.2. INTERNET PROTOCOL VERSION 4 (IPV4)
• Older documents use a so called netmask which is a bitfield of the size of an IPv4 address
which gives the network identifies by performing a logical bitwise and operation with an IPv4
address (e.g., 192.0.2.0 & 255.255.255.0).
Not all possible IPv4 addresses can be used in the global Internet without restrictions. Some addresses are reserved or have special semantics attached to them, as described in RFC 3330 [19]:
Address Block
0.0.0.0/8
10.0.0.0/8
14.0.0.0/8
24.0.0.0/8
39.0.0.0/8
127.0.0.0/8
128.0.0.0/16
169.254.0.0/16
172.16.0.0/12
191.255.0.0/16
192.0.0.0/24
192.0.2.0/24
192.88.99.0/24
192.168.0.0/16
198.18.0.0/15
223.255.255.0/24
224.0.0.0/4
240.0.0.0/4
Present Use
”This” Network
Private-Use Networks
Public-Data Networks
Cable Television Networks
Class A Subnet Experiment
Loopback
Reserved by IANA
Link Local
Private-Use Networks
Reserved by IANA
Reserved by IANA
Test-Net / Documentation
6to4 Relay Anycast
Private-Use Networks
Network Interconnect / Device Benchmark Testing
Reserved by IANA
Multicast
Reserved for Future Use
Reference
[RFC1700]
[RFC1918]
[RFC1700]
[RFC3330]
[RFC1797]
[RFC1700]
[RFC3330]
[RFC3330]
[RFC1918]
[RFC3330]
[RFC3330]
[RFC3330]
[RFC3068]
[RFC1918]
[RFC2544]
[RFC3330]
[RFC3171]
[RFC1700]
• Adresses for private networks, which are not routed through the public global Internet, can be
taken from the address blocks 10.0.0.0/8 und 192.168.0.0/16 as specified in RFC 1918 [20].
• Test addresses or addresses that are used solely for documentation purposes can be taken
from the address block 192.0.2.0/24.
• Address from the address block 0.0.0.0/8 identify a sender which is not yet fully configured
(typically 0.0.0.0).
• The address block 127.0.0.1/8 identifies the local node, also called the loopback network.
• The special address 255.255.255.255 causes a local broadcast.
3.2.2
IPv4 Packet Format
IPv4 packets have the following structure as specified in RFC 791 [18]:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service|
Total Length
|
32
CHAPTER 3. INTERNET NETWORK LAYER
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Identification
|Flags|
Fragment Offset
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live |
Protocol
|
Header Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Destination Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Options
|
Padding
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Remarks:
• The Version field contains the version number (4).
• The length of the protocol header is stored in the Internet Header Length (IHL)
field. The length is counted in the number of 4 byte words. The minimum header length is 5
(which corresponds to 20 bytes) and the maximum header length is 15 (which corresponds to
60 bytes).
• The interpretation of the Type of Service (TOS) field has changed over time. The
current interpretation of this field uses six bit as the Differentiated Services Code Point
(DSCP) and two bits for explicit congestion notifications (ECN), as specified in RFC 2474
[21], RFC 3168 [22] and RFC 3260 [23].
0
1
2
3
4
5
6
7
+-----+-----+-----+-----+-----+-----+-----+-----+
|
DS FIELD, DSCP
| ECN FIELD |
+-----+-----+-----+-----+-----+-----+-----+-----+
• The length of the IPv4 packet (including the protocol header) is stored in the Total Length
field. Since this is a 16-bit field, IPv4 packets can have a maximum length of 65535 bytes.
• The fields Identification, Flags and Fragment Offset support fragmentation of
IPv4 packets. The Identification field contains the same value for all fragments of an
IPv4 packet. The Fragment Offset field contains the relative position of a fragment of
an IPv4 packet (counted in 64-bit words). The flag More Fragments (MF) is set if more
fragments follow. The flag Don’t Fragment (DF) can be set to indicate that the sender
does not want fragmentation, in which case an IPv4 packet will be discarded if it does not fit
into the maximum frame size of the outgoing link of a node. Note that IPv4 allows fragments
to be further fragmented without intermediate reassembly.
• The Time to Live (TTL) field is used to limit the lifetime of an IPv4 packet. The
lifetime is usually measured in the number of hops passed rather than a period in time. Every
router forwarding an IPv4 packet decrements this field and the packet is discarded once the
value of the field becomes zero.
• The Protocol fiel identifies the protocol contained in the IPv4 packet, in most cases one
of the Internet transport protocols.
33
3.2. INTERNET PROTOCOL VERSION 4 (IPV4)
• The Header Checksum field contains the Internet checksum computed over the header.
• The Source Address and Destination Address field contain the source and destination address of the packet.
• There are a number of options which can be used to control the forwarding of a packet or
which cause routers to append forwarding information to the protocol header. Most of these
options are practically irrelevant since the remaining 40 bytes are usually not enough for
using these options.
3.2.3
IPv4 Forwarding
Every node maintains a forwarding table (also sometimes called the forwarding information base)
which is used to direct IPv4 packets closer to their destination [24].
• The forwarding table realizes a mapping of the network prefix to the next node (next hop)
and the local interface used to reach the next node.
• For every IP packet, the entry in the forwarding table has to be found with longest matching
network address prefix (longest prefix match).
The following example shows on a simple network topology the contents of the various forwarding
tables involved:
R1
Prefix
0.0.0.0/0
134.169.34.0/24
134.169.0.0/16
H1
134.169.2.1
Next Hop
134.169.2.1
134.169.246.34
134.169.9.10
Interface
eth0
eth0
eth0
134.169.9.10
134.169.0.0/16
134.169.246.34
Prefix
0.0.0.0/0
134.169.0.0/16
134.169.34.0/24
Next Hop
134.169.2.1
134.169.246.34
134.169.34.12
Interface
eth0
eth0
eth1
H2
R2
134.169.34.12
Prefix
0.0.0.0/0
134.169.34.0/24
Next Hop
134.169.34.12
134.169.34.1
Interface
eth0
eth0
134.169.34.1
134.169.34.0/24
Figure 3.1: Example for IPv4 forwarding
Variations and extensions of the basic forwarding model:
• A node has multiple forwarding tables. Information contained in fields of the incoming IP
packet (e.g., the DSCP value) is used to select one out of many forwarding tables to forward
the packet.
• Instead of maintaining the forwarding table(s) in a central place of a router, it is possible to
store at least frequently used parts on the router interfaces. This approach allows to parallize
forwarding lookups at the cost of more complex updates if routing tables change.
34
CHAPTER 3. INTERNET NETWORK LAYER
• Another approach to increase performance is to use chaches for frequently used destination
addresses.
The performance of IP address lookups is crucial for high-speed IP routers:
• Forwarding tables can become very large (around 100000 entries on backbone routers have
been reported in January 2001).
• One technique used to reduce the size of forwarding tables is called address aggregation. If
a router has multiple forwarding table entries with a common prefix which point to the same
interface, the router can aggregate these entries into a single entry with a shorter prefix length.
Exceptions can still be handled by having some entries with a longer prefix.
• Due to the grows in the number of packets per second a router has to handle and the grows of
the forwarding tables, it is crucial to design lookup algorithms that scale well in the number
of addresses stored in a forwarding table. Note that routing updates occur frequently in
backbone routers and thus update operations must be reasonable fast as well.
• Large forwarding tables are usually represented as a tries so that the complexity of lookup
operations depends on the distribution of the length of network prefixes and not on the total
number of table entries [25]. A trie is a tree-based data structure allowing the organization of
prefixes on a digital basis by using the bits of prefixes to direct the branching.
• The usage of optimized tree representations, usually implemented in hardware, provides the
performance that is needed to handle IP on very high speed links. See [26] for a good survey
on fast IP address lookup algorithms.
3.2.4
IPv4 Error Handling (ICMPv4)
The Internet Control Message Protocol (ICMP) as specified in RFC 792 [27] is used to inform
nodes about problems encountered while forwarding IP packets. It also introduces messages which
can be used to perform simple tests. ICMP messages are transported in the payload of ordinary IP
packets.
In the following, a selection of ICMP message formats will be discussed. ICMP messages in general
contain a checksum which is computed over the ICMP message in order to detect some bit errors
in ICMP messages.
Echo Request/Reply
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Identifier
|
Sequence Number
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Data ...
+-+-+-+-+-
3.2. INTERNET PROTOCOL VERSION 4 (IPV4)
35
• The ICMP echo request message (type = 8, code = 0) asks the destination node to return an
echo reply message (type = 0, code = 0) to the sender of the echo request message.
• The Identifier and Sequence Number fields are used by the sender to correlate incoming replies with previously sent requests.
• The data field may contain additional data or just fill bytes in order to bring the IP packets to
a certain size.
Unreachable Destinations
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
unused
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Internet Header + 64 bits of Original Data Datagram
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Type field has the value 3 for all unreachable destination messages.
• The Code field indicates why a certain destination is not reachable:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Net Unreachable
Host Unreachable
Protocol Unreachable
Port Unreachable
Fragmentation Needed and Don’t Fragment was Set
Source Route Failed
Destination Network Unknown
Destination Host Unknown
Source Host Isolated
Communication with Destination Network is Administratively Prohibited
Communication with Destination Host is Administratively Prohibited
Destination Network Unreachable for Type of Service
Destination Host Unreachable for Type of Service
Communication Administratively Prohibited
Host Precedence Violation
Precedence cutoff in effect
• The data field contains the beginning of the packet which caused the ICMP unreachable
destination message.
Redirect
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
36
CHAPTER 3. INTERNET NETWORK LAYER
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Router Internet Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Internet Header + 64 bits of Original Data Datagram
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Type field has the value 5 for all redirect messages.
• The Code field indicates which type of packets should be redirected:
0
1
2
3
Redirect datagrams for the Network.
Redirect datagrams for the Host.
Redirect datagrams for the Type of Service and Network.
Redirect datagrams for the Type of Service and Host.
• The Router Internet Address field contains the IP address of the router to which
packets should be redirected.
• The data field contains the beginning of the packet which caused the ICMP redirect message.
3.2.5
MTU Path Discovery
Fragmentation of IPv4 packets is problematic for several reasons [28]:
• The receiver must buffer fragments until all fragments have been received. However, it is not
useful to keep fragments in a buffer indefinately. Hence, the TTL field of all buffered packets
will be decremented once per second and fragments are dropped when the TTL field becomes
zero.
• The loss of a fragment causes in most cases the sender to resend the original IP packet which
in most cases gets fragmented as well. Hence, the probability of transmitting a large IP packet
successfully goes quickly down if the loss rate of the network goes up.
• Since the Identification field identifies fragments that belong together and the number
space is limited, one cannot fragment an arbitrary large number of packets.
An obvious solution for the problem is to cause the sender never to generate packets that are larger
than the path MTU and thus never have to be fragmented [29]. To make this simple solution work,
the sender has to be able to learn the path MTU:
• The sender sends IPv4 packets with the DF flag turned on.
• A router which has to fragment a packet with the DF flag turned on drops the packet and sends
an ICMP message back to the sender which also includes the local maximum link MTU.
• Upon receiving the ICMP message, the sender adapts his estimate of the path MTU and
retries.
• Since the path MTU can change dynamically (since the path can change), a once learned path
MTU should be verified and adjusted periodically.
3.2. INTERNET PROTOCOL VERSION 4 (IPV4)
37
• Not all routers send necessarily the local link MTU. In this cases, the sender usually tries
typical MTU values, which is usually faster than doing a binary search.
3.2.6
IPv4 over IEEE 802.3
IPv4 packets are sent in the payload of IEEE 802.3 frames according to the specification in RFC
894 [30].
• IPv4 packets are identified by the value 0x800 in the IEEE 802.3 type field.
• According to the maximum length of IEEE 802.3 frames, the maximum link MTU is 1500
byte.
• The mapping of IPv4 addresses to IEEE 802.3 addresses is table driven. Entries in so called
mapping tables (sometimes also called address translation tables) can either be statically configured or dynamically learned.
3.2.7
IPv4 Adress Translation (ARP, RARP)
The Address Resolution Protocol (ARP) defined in RFC 826 [31] allows an IP node to determine
the link-layer address of a neighboring node on a broadcast network. The fundamental principle
here is to broadcast a message asking for the translation of an IP address to a link-layer to all stations
attached to a broadcast network. Since the message is broadcasted, it will also reach the node which
has the IP address assigned to one of its interfaces. This node can thus respond by sending a unicast
message back to the node which asked the question.
Subsequently, an extension was defined which allows to perform reverse address resolutions. The
Reverse Address Resolution Protocol defined in RFC 903 [32] resolves a node’s hardware address
to an IP address.
In case of IPv4 addresses and IEEE 802.3 addresses, the following message format is used for both
ARP and RARP:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Hardware Type
|
Protocol Type
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
HLEN
|
PLEN
|
Operation
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Sender Hardware Address (SHA)
=
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
= Sender Hardware Address (SHA) |
Sender IP Address (SIP)
=
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
=
Sender IP Address (SIP)
| Target Hardware Address (THA) =
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
=
Target Hardware Address (THA)
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Target IP Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
38
CHAPTER 3. INTERNET NETWORK LAYER
• The ARP message format is not aligned to 32-bit word boundaries in case of IPv4 addresses
and IEEE 802 MAC addresses.
• The Hardware Type field identifies the address type used on the link-layer (the value 1 is
used for IEEE 802.3 MAC addresses).
• The Protocol Type field identifies the network layer address type (the value 0x800 is
used for IPv4).
• ARP/RARP packets use the type value 0x806 in the IEEE 802.3 frame.
• The Operation field contains the message type: ARP Request (1), ARP Response (2),
RARP Request (3), RARP Response (4).
• The sender fills, depending on the request type, either the Target Hardware Address
(RARP) field or the Target IP Address (ARP) field.
• The responding node swaps the Sender/Target fiels and fills the empty fields with the requested information.
3.2.8
Automatic Configuration (DHCP)
The Dynamic Host Configuration Protocol (DHCP) defined in RFC 2131 [33] allows nodes (DHCP
clients) to retrieve configuration parameters dynamically from a central configuration server (DHCP
server). A binding is a collection of configuration parameters, including at least an IP address,
associated with or bound to a DHCP client. Bindings are managed by DHCP servers.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
op (1)
|
htype (1)
|
hlen (1)
|
hops (1)
|
+---------------+---------------+---------------+---------------+
|
xid
(4) - transaction id
|
+-------------------------------+-------------------------------+
|
secs
(2)
|
flags (2)
|
+-------------------------------+-------------------------------+
|
ciaddr (4) - client IPv4 address
|
+---------------------------------------------------------------+
|
yiaddr (4) - your (client) IPv4 address
|
+---------------------------------------------------------------+
|
siaddr (4) - next server IPv4 address
|
+---------------------------------------------------------------+
|
giaddr (4) - relay agent IPv4 address
|
+---------------------------------------------------------------+
|
|
|
chaddr (16) - client hardware address of
|
|
type htype and length hlen
|
|
|
+---------------------------------------------------------------+
|
|
39
3.2. INTERNET PROTOCOL VERSION 4 (IPV4)
|
sname
(64) - server name (null terminated)
|
|
|
+---------------------------------------------------------------+
|
|
|
file
(128) - boot file name (null terminated)
|
|
|
+---------------------------------------------------------------+
|
|
|
options (variable)
|
+---------------------------------------------------------------+
The DHCP protocol supports the following message types:
• The DHCPDISCOVER message is a broadcast message which is sent by DHCP clients to
locate DHCP servers.
• The DHCPOFFER message is sent from a DHCP server to offer a client a set of configuration
parameters.
• The DHCPREQUEST is sent from the client to a DHCP server as a response to a previous
DHCPOFFER message, to verify a previously allocated binding or to extend the lease of a
binding.
• The DHCPACK message is sent by a DHCP server with some additional parameters to the
client as a positive acknowledgement to a DHCPREQUEST.
• The DHCPNAK message is sent be a DHCP server to indicate that the client’s notion of a
configuration binding is incorrect.
• The DHCPDECLINE message is sent by a DHCP client to indicate that parameters are already
in use.
• The DHCPRELEASE message is sent by a DHCP client to inform the DHCP server that
configuration parameters are no longer used.
• The DHCPINFORM message is sent from the DHCP client to inform the DHCP server that
only local configuration parameters are needed.
A typical exchange between a client and two candidate servers is displayed below.
Server
(not selected)
Client
Server
(selected)
v
v
v
|
|
|
|
Begins initialization
|
|
|
|
| _____________/|\____________ |
|/DHCPDISCOVER | DHCPDISCOVER \|
|
|
|
Determines
|
Determines
configuration
|
configuration
40
CHAPTER 3. INTERNET NETWORK LAYER
|
|
|
|\
| ____________/ |
| \________
| /DHCPOFFER
|
| DHCPOFFER\
|/
|
|
\ |
|
|
Collects replies
|
|
\|
|
|
Selects configuration
|
|
|
|
| _____________/|\____________ |
|/ DHCPREQUEST | DHCPREQUEST\ |
|
|
|
|
|
Commits configuration
|
|
|
|
| _____________/|
|
|/ DHCPACK
|
|
|
|
|
Initialization complete
|
|
|
|
.
.
.
.
.
.
|
|
|
|
Graceful shutdown
|
|
|
|
|
|\ ____________ |
|
| DHCPRELEASE \|
|
|
|
|
|
Discards lease
|
|
|
v
v
v
See RFC 2131 [33] for a complete state diagram and a complete description of the possible transitions.
The options field of DHCP messages can contain various configuration options. Options may have a fixed
length of variable length. All options begin with a tag octet, usually followed by a length field and the actual
value (tag-length-value, TLV). An initial set of options such as options to configure a list of routers, a list of
name servers and so on is defined in RFC 2132 [34].
Some security aspects related to the lack of authentication within DHCP are discussed in RFC 3118 [35] and
a proposal is made to provide delayed authentication, which is still subject to denial of service attacks.
3.3
Internet Protocol Version 6 (IPv6)
In the early 1990s, it became clear that the IPv4 address space is not large enough to support the expected
growth of the Internet. Work started to define version 6 of the IP protocol (IPv6). The current core IPv6
specification was published in 1998 in RFC 2460 [17] as Draft Standard. Implementations of IPv6 are
available on almost all platforms.
The primary goals driving the development of IPv6 were:
• Increase of the address space from 32 bit to 128 bit.
• Simplification of protocol headers for the most common cases to reduce processing costs and bandwidth consumption in the normal cases.
• Improved support for protocol extensions and options.
41
3.3. INTERNET PROTOCOL VERSION 6 (IPV6)
• Capability to mark packets that belong to particular traffic flows for which the sender requests special
handling.
• Authentication and privacy capabilities to support authentication, data integrity and optional data confidentiality of IPv6 packets.
• Integrated automatic end-system configuration capabilities.
3.3.1
IPv6 Addresses
IPv6 addresses are 128-bit identifiers for interfaces and sets of interfaces. The details are defined in RFC
3513 [8]. There are three types of IPv6 addresses:
• A unicast address is an identifier for a single interface. A packet sent to a unicast address is delivered
to the interface identified by that address.
• A anycast address is an identifier for a set of interfaces. A packet sent to an anycast address is delivered
to one of the interfaces identified by that address.
• A multicast address is an identifier for a set of interfaces. A packet sent to a multicast address is
delivered to all interfaces identified by that address.
The type of an IPv6 address is identified by the high-order bits of the address:
Address type
Unspecified
Loopback
Multicast
Link-local unicast
Site-local unicast
Global unicast
Binary prefix
00...0 (128 bits)
00...1 (128 bits)
11111111
1111111010
1111111011
(everything else)
IPv6 notation
::/128
::1/128
FF00::/8
FE80::/10
FEC0::/10
Table 3.1: IPv6 address type identification
Anycast addresses are taken from the unicast address spaces (of any scope) and are not syntactically distinguishable from unicast addresses.
Interface Identifiers
Interface identifiers in IPv6 unicast addresses are used to uniquely indentify interfaces on a link. For all
unicast addresses, except those that start with binary 000, interface identifiers are required to be 64 bits long
and to be constructed in modified EUI-64 format.
The modified EUI-64 format can be obtained from IEEE 802 MAC addresses by inserting two octets with the
hexadecimal values 0xFF and 0xFE in the middle of the 48-bit MAC address. A 48-bit IEEE MAC address
with global scope has the following format:
|0
1|1
3|3
4|
|0
5|6
1|2
7|
+----------------+----------------+----------------+
|cccccc0gcccccccc|ccccccccmmmmmmmm|mmmmmmmmmmmmmmmm|
+----------------+----------------+----------------+
42
CHAPTER 3. INTERNET NETWORK LAYER
The c bits are the assigned company identification, 0 is the universal/local bit to indicate global scope, the g
bit is the individual/group bit, and m bits are the manufacturer selected extension identifier. The corresponding
modified EUI-64 identifier has the following format:
|0
1|1
3|3
4|4
6|
|0
5|6
1|2
7|8
3|
+----------------+----------------+----------------+----------------+
|cccccc1gcccccccc|cccccccc11111111|11111110mmmmmmmm|mmmmmmmmmmmmmmmm|
+----------------+----------------+----------------+----------------+
With this transformation, it is possible to compute a link local IPv6 address for each physical IEEE 802 MAC
interface.
While the automatic computation of interface identifier from MAC addresses is a simple way to construct
link local and global IPv6 addresses, some people have concerns that these IPv6 addresses can be used to
track mobile nodes used in different networks. There are two approaches to address this concern: The first
approach is to use DHCP instead of IPv6 auto-configuration to assign IPv6 addresses. The other approach
documented in RFC 2893 [36] is to generate a pseudo-random sequence of interface identifiers via a oneway
hash function which depends on a random component and the globally unique interface identifier (where
available). The pseudo-random interface identifiers are then only used for a certain period of time.
Global Unicast Addresses
The general format for IPv6 global unicast addresses is as follows:
|
n bits
|
m bits |
128-n-m bits
|
+------------------------+-----------+----------------------------+
| global routing prefix | subnet ID |
interface ID
|
+------------------------+-----------+----------------------------+
The global routing prefix is a typically hierarchically structured value assigned to a site (a cluster of subnets/links), the subnet ID is an identifier of a link within the site.
IPv6 Addresses with Embedded IPv4 Addresses
There is a special IPv6 address space which which contains the complete IPv4 address space. The so called
mapped IPv4 addresses where invented to make the transition from IPv4 to IPv6 networks easier. There is
an ongoing controversy whether this is actually the case.
|
80 bits
| 16 |
32 bits
|
+--------------------------------------+--------------------------+
|0000..............................0000|0000|
IPv4 address
|
+--------------------------------------+----+---------------------+
Link-Local Unicast Addresses
Link-local unicast addresses are assigned automatically and guaranteed to be unique on the link attached to
an interface.
|
10
|
| bits
|
54 bits
|
64 bits
|
+----------+-------------------------+----------------------------+
|1111111010|
0
|
interface ID
|
+----------+-------------------------+----------------------------+
3.3. INTERNET PROTOCOL VERSION 6 (IPV6)
3.3.2
43
IPv6 Packet Format
IPv6 packets have the following structure, as specified in RFC 2460 [17]:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| Traffic Class |
Flow Label
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Payload Length
| Next Header |
Hop Limit
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
Source Address
+
|
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
Destination Address
+
|
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Version field contains the version number 6.
• The Traffic Class field contains in the current interpretation the Differentiated Services Code
Point (DSCP) as well as two bits for explicit congestion notification [21, 22, 23].
0
1
2
3
4
5
6
7
+-----+-----+-----+-----+-----+-----+-----+-----+
|
DS FIELD, DSCP
| ECN FIELD |
+-----+-----+-----+-----+-----+-----+-----+-----+
• The Flow Label field allows to mark packets transmitted from a source address to a destination
address which belong to a certain traffic flow (e.g., all packets that belong to a certain voice call). The
motivation for this field is that routers can handle packets that belong to a certain flow in a specific
way.
• The Payload Length field contains the length of the payload following the IPv6 protocol header.
Note that this field is different from the IPv4 Total Length field.
• The Next Header field identifies the type of the payload following the header. This is roughly
equivalent to the IPv4 Protocol field. Note, however, that IPv6 uses a daisy-chain of IPv6 headers
to realize IPv6 options as discussed below.
• The Hop Limit field is used to limit the lifetime of IPv6 packets. Every router which forwards and
IPv6 packet decrements this field and the packet is discarded if the value reaches zero.
• The Source Address and Destination Address fields contain the 128-bit source and destination addresses.
44
CHAPTER 3. INTERNET NETWORK LAYER
3.3.3
IPv6 Extensions
Compared to the IPv4 packet formant, the IPv6 packet format is much simpler. This has been achieved by
moving some functionality into so called extension headers which can be carried in a daisy chain between
the IPv6 protocol header and the actual payload.
If a node does not understand an extension header, it has to discard the whole packet. Parameters, which can
be ignored by implementations, are called options and they are carried in special extension headers.
Routing Extension Header
The Routing Header (RH) is an extension header that can be used by the sender to specify one or more nodes
that must be visited on the way to the destination.
The RH extension header as defined in RFC 2460 [17] has the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header | Hdr Ext Len | Routing Type | Segments Left |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
.
.
.
Type-Specific Data
.
.
.
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Next Header field identifies the type of the payload following the RH extension header.
• The Hdr Ext Len field contains the length of the RH counted in 64-bit words minus 1.
• The Routing Type field identifies a certain variant of the RH and the semantics of the field
Type-Specific Data. At the time of this writing, only a single routing type has been defined.
• The Segments Left field indicates the number of remaining routing segments.
• The contents of the Type-Specific Data field depends on the value of the Routing Type
field. This field contains, under the currently defined Routing Type, 32 unused bits followed by
a sequence of 128-bit fields where each 128-bit field contains an IPv6 address. When an IPv6 packet
reaches the destination and if there are remaining segments, then the next routing address is copied
into the destination address field, the number of remaining segments is decremented and the packet if
forwarded to the new destination address.
Fragment Extension Header
IPv6 assumes that every link has a link MTU of at least 1280 bytes [17]. Links that only support smaller
MTUs must provide fragmentation and reassembly services below the IPv6 layer. Simple IPv6 implementations which do not perform MTU path discovery must restrict themself to packet which do not exceed 1280
bytes. Packets, which are bigger than the path MTU, can be fragmented by using the Fragment Header (FH)
extension. Only IPv6 source nodes are allowed to fragment IPv6 packets. In contrast to IPv4, routers are not
allowed to fragment packets.
The FH extension header as defined in RFC 2460 [17] has the following format:
3.3. INTERNET PROTOCOL VERSION 6 (IPV6)
45
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header |
Reserved
|
Fragment Offset
|Res|M|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Identification
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Next Header field identifies the type of the payload following the FH extension header.
• The Fragment Offset field defines the relative position of the fragment (counted in 64-bit words)
in the original IPv6 packet.
• The flag M is set if more fragments follow. The bits Res are currently unused and reserved.
• The Identification field contains the same value for all fragments of an IPv6 packet.
Authentication Extension Header
The Authentication Header (AH) extension header is used to provide data origin authentication, data integrity
and replay protection services for IPv6 packets.
The AH extension header as defined in RFC 2402 [37] has the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header
| Payload Len |
RESERVED
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Security Parameters Index (SPI)
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Sequence Number
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
Authentication Data (variable)
|
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Next Header field identifies the type of the payload following the AH extension header.
• The Payload Len field contains the length of the AH extension header counted in the number of
32-words minus 1.
• The Security Parameters Index field contains a value which together with the destination
address identifies a so called Security Association (SA). The SA is basically a data structure which
maintains all the necessary cryptographic information.
• The Sequence Number field contains a monotonically increasing sequence number. The first
packet which is sent after establishing a SA has the sequence number 1. If the sequence number
reaches 232 , a new SA has to be established.
• The Authentication Data field contains an integrity check value, ICV. The length of this field
depends on the authentication function in use (which is determined by the SA).
46
CHAPTER 3. INTERNET NETWORK LAYER
Encapsulating Security Payload Extension Header
The Encapsulating Security Payload (ESP) extension header realizes security services such as confidentiality,
data origin authentication, data integrity, replay protection and limited traffic flow confidentiality.
The ESP as defined in RFC 2406 [38] has the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Security Parameters Index (SPI)
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Sequence Number
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Payload Data* (variable)
|
˜
˜
|
|
+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
Padding (0-255 bytes)
|
+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
| Pad Length
| Next Header
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Authentication Data (variable)
|
˜
˜
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
---ˆAuth.
|Cov|erage
| ---|
ˆ
|
|
|Conf.
|Cov|erage*
|
|
v
v
------
• The Security Parameters Index field contains a value which together with the destination
address identifies a so called Security Association (SA). The SA is basically a data structure which
maintains all the necessary cryptographic information.
• The Sequence Number field contains a monotonically increasing sequence number. The first
packet which is sent after establishing a SA has the sequence number 1. If the sequence number
reaches 232 , a new SA has to be established.
• The Payload Data field contains the encrypted payload (including any required initialization vectors).
• The Padding field can be used to align the payload to a certain desired length or to provide a certain
size required by the encryption function. The Padding field can also be used to hide the original size
of the actual payload.
• The Pad Length field contains the number of fill bytes.
• The Next Header field identifies the type of the payload.
• The Authentication Data field contains an integrity check value, ICV. The length of this field
depends on the authentication function in use (which is determined by the SA).
• Fragmentation can only happen after encryption. It is not allowed to apply ESP on a fragment.
Hop-by-Hop Options Extension Header
The Hop-by-Hop Options (HO) extension header carries optional information that must be examined by
every node along a packet’s delivery path.
The HO extension header as defined in RFC 2460 [17] has the following format:
3.3. INTERNET PROTOCOL VERSION 6 (IPV6)
47
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header | Hdr Ext Len |
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
|
|
.
.
.
Options
.
.
.
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Next Header field identifies the type of the payload following the HO extension header.
• The Hdr Ext Len field contains the length of the HO counted in 64-bit words minus 1.
• The Options field contains the list of options. Each option is encoded as a tag-length-value (TLV)
triple:
0
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- - - - - - - - | Option Type | Opt Data Len | Option Data
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- - - - - - - - The Option Type field identifies the option kind and the Option Data Len field contains the
length of the Option Data field counted in bytes. The sequence of options in a HO extension
header is processed in the order they appear in the header.
Destination Options Extension Header
The Destination Options (DO) extension header carries optional information that must be processed by the
final receiver of the packet.
The DO extension header as defined in RFC 2460 [17] has the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header | Hdr Ext Len |
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
|
|
.
.
.
Options
.
.
.
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The fields Next Header, Hdr Ext Len and Options have the same format and semantics as
in the HO extension header.
48
CHAPTER 3. INTERNET NETWORK LAYER
3.3.4
IPv6 Forwarding
IPv6 packets are forwarded using the longest prefix match algorithm which is used in the IPv4 network.
However, IPv6 addresses have much longer prefixes which allows to do better address aggregation in order
to reduce the number of forwarding table entries. On the other hand, due to the length of the prefixes, it is
even more crucial to use an algorithm whose complexity does not dependent on the number of entries in the
forwarding table or the average prefix length.
3.3.5
IPv6 Error Handling (ICMPv6)
The Internet Control Message Protocol Version 6 (ICMPv6) is an adapted version of the ICMPv4 protocol.
It introduces a set of control messages which are needed to report errors, to run diagnostic tests, to autoconfigure IPv6 nodes and to resolve IPv6 addresses to link-layer addresses.
The ICMPv6 messages defined in RFC 2463 [39] have the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
Message Body
+
|
|
• The Type field identifies the type of an ICMPv6 message. ICMPv6 messages are categorized into
error message (Type 0-127) and informational messages (Type 128-255).
Type
1
2
3
4
128
129
133
134
135
136
137
Description
Destination Unreachable
Packet Too Big
Time Exceeded
Parameter Problem
Echo Request
Echo Reply
Router Solicitation
Router Advertisement
Neighbor Solicitation
Neighbor Advertisement
Redirect
Reference
RFC 2463
RFC 2463
RFC 2463
RFC 2463
RFC 2463
RFC 2463
RFC 2461
RFC 2461
RFC 2461
RFC 2461
RFC 2461
Table 3.2: ICMPv6 message types
• The Code field contains a value which further discriminates the message type. The exact meaning of
the Code value depends on the contents of the Type field.
• The Checksum field contains the Internet checksum computed over the ICMPv6 message and parts
of the IPv6 protocol header.
• The contents of the Message Body depends on the ICMPv6 message type.
3.3. INTERNET PROTOCOL VERSION 6 (IPV6)
3.3.6
49
IPv6 over IEEE 802.3
IPv6 packets can be encapsulated into IEEE 802.3 frames and sent over IEEE 802.3 packets as defined in
RFC 2464 [40]:
• Frames containing IPv6 packets are identified by the value 0x86dd in the IEEE 802.3 type field.
• The link MTU is 1500 bytes which corresponds to the IEEE 802.3 maximum frame size of 1500 byte.
• The mapping of IPv6 addresses to IEEE 802.3 addresses is table driven. Entries in so called mapping tables (sometimes also called address translation tables) can either be statically configured or
dynamically learned using neighbor discovery.
3.3.7
IPv6 Neighbor Discovery
IPv6 supports the automatic configuration of hosts (autoconfiguration) and the discovery of neighbors attached to the same link. The Neighbor Discovery (ND) which is part of the ICMPv6 protocol simplifies the
configuration of hosts and includes features that are realized by different protocols (ICMPv4, ARP) in IPv4.
ND as documented in RFC 2461 [41] supports the following features:
• Discovery of the local routers that are attached to the same link (router discovery).
• Discovery of the prefixes used on a link-layer so that it is possible to determine which IPv6 addresses
can be reached directly (prefix discovery).
• Discovery of parameters such as the link MTU or the hop limit for outgoing packets (parameter discovery).
• Automatic configuration of IPv6 addresses (address autoconfiguration).
• Resolution of IPv6 addresses to link-layer addresses (address resolution).
• Determination of next-hop addresses for IPv6 destination addresses (next-hop determination).
• Detection of unreachable nodes which are attached to the same link (neighbor unreachability detection).
• Detection of conflicts that can arise during address generation (duplicate address detection).
• Discovery of better alternatives to forward packets (redirect).
The ND protocol uses some special IPv6 addresses:
• all-nodes: The link-local multicast address FF02::1 is used to reach all nodes connected to a link.
• all-routers: The link-local multicast address FF02::2 is used to reach all routers connected to a link.
• solicited-node: A link-local multicast address which is derived from the address of a node which
is formed by taking the low-order 24 bits of the address and appending those bits to the prefix
FF02:0:0:0:0:1:FF00::/104.
• link-local: A link-local unicast address which in the case of IEEE 802 links can be derived from the
IEEE 802 MAC address as discussed above.
The ND protocol is realized as an extension of the ICMPv6 protocol and introduces five new message formats.
To prevent some attacks on the ND protocol, it is required that the Hop Limit field of the IPv6 protocol
header is set to the value 255. Receiver of ND protocol messages must discard messages where the Hop
Limit field does not contain the value 255. Packets can only contain a value unequal to 255 if the packet
has been forwarded by a router, which might be a potential attack from somewhere outside the link.
50
CHAPTER 3. INTERNET NETWORK LAYER
Router Solicitation
Hosts can ask routers attached to a link to generate router advertisements by sending a Router Solicitation
(RS) message to the all-routers link-local multicast group. The format of the RS message is as follows:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Options ...
+-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 133 and the Code field contains the value 0.
• The Checksum field contains the usual ICMPv6 checksum.
• The Options field may contain the link-layer address of the sender, if known.
Router Advertisement
Routers send periodically or as a reaction to an RS message Router Advertisement (RA) messages to the
all-nodes multi-cast group. The format of the RA message is as follows:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Cur Hop Limit |M|O| Reserved |
Router Lifetime
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Reachable Time
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Retrans Timer
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Options ...
+-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 134 and the Code field contains the value 0.
• The Checksum field contains the usual ICMPv6 checksum.
• The Cur Hop Limit field contains a proposed value which should be used by hosts in the Hop
Limit field of outgoing IPv6 packets.
• The flag M indicates that hosts should use in addition another mechanism such as DHCPv6 for the
autoconfiguration of addresses (managed address configuration).
• The flag O indicates that hosts should use in addition another mechanism such as DHCPv6 for the
autoconfiguration of other parameters (other stateful configuration).
• The Router Lifetime field defines the time (in seconds) in which the advertised router may be
used as a default router.
3.3. INTERNET PROTOCOL VERSION 6 (IPV6)
51
• The Reachable Time field defines the time (in milliseconds) in which a node assumes a neighbor
is reachable after having received a reachability confirmation.
• The Retrans Timer field defines the time (in milliseconds) between retransmitted Neighbor Solicitation messages.
• The Options field may contain additional parameters such as the link-layer address of the sending
router, the link MTU or information about the prefixes that are used on the link.
Neighbor Solicitation
Hosts can ask other notes attached to a link to generate neighbor advertisements by sending a Neighbor
Solicitation (NS) message to the all-nodes link-local multicast group. The format of the NS message is as
follows:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
Target Address
+
|
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Options ...
+-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 135 and the Code field contains the value 0.
• The Checksum field contains the usual ICMPv6 checksum.
• The Target Address field contains the address for which information is requested.
• The Options field may contain the link-layer address of the sender, if known.
Neighbor Advertisement
Hosts send a Neighbor Advertisement (NA) message as a reaction to a Neighbor Solicitation. Unsolicited
NA messages can also be sent in order to propagate changes quickly. Solicited NA messages are sent to the
IPv6 address of the requestor whicl unsolicited NA messages are sent to the all-nodes multicast group.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|S|O|
Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
52
CHAPTER 3. INTERNET NETWORK LAYER
+
+
|
|
+
Target Address
+
|
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Options ...
+-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 136 and the Code field contains the value 0.
• The Checksum field contains the usual ICMPv6 checksum.
• The flag R indicates that the sender is a router. The flag S indicates that the message is sent as
a reaction to a Neighbor Solicitation message. The flag O indicates that the contained information
should overwrite any existing cache entries.
• The Options field may contain the link-layer address of the sender, if known.
Redirect
Router can generate Redirect (R) messages to inform hosts about better paths towards a given destination
address.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type
|
Code
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
Target Address
+
|
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
Destination Address
+
|
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Options ...
+-+-+-+-+-+-+-+-+-+-+-+• The Type field contains the value 137 and the Code field contains the value 0.
• The Checksum field contains the usual ICMPv6 checksum.
3.4. ROUTING PROTOCOLS
53
• The Target Address field contains the IPv6 address or a router which provides a better path to
the destination address.
• The Destination Address field contains the destination address which is being redirected.
• The Options field may contain the link-layer address of the Target Address, if known. In addition, the Options field should contain the beginning of the IPv6 packet which caused the generation
of the Redirect message.
3.4
Routing Protocols
The forwarding of packets in the Internet is controlled by forwarding tables which are present on all nodes.
Every node has a more or less limited view about the overall network topology and relies on the help of
other routers to move packets closer to their destination. In order to establish and maintain connectivity, it is
necessary to update and synchronize these forwarding tables. While this can be done manually on small leaf
networks, it is necessary to automate this process in larger and core backbone networks. Routing protocols
have been developed for exactly this purpose.
• For routing purposes, the Internet is divided into autonomous systems (ASs). An autonomous system
(AS) is basically a set of routers and networks under the same administration.
• The routing protocol(s) within an AS are called Interior Gateway Protocols (IGPs). They are generally independent from the routing protocols used in other ASs. Widely used IGPs are the Routing
Information Protocol (RIP) and the Open Shortest Path First (OSPF) protocol.
• The routing protocol(s) between ASs are called Exterior Gateway Protocols (EGPs). The currently
most widely used EGP is the Border Gateway Protocol version 4 (BGP4).
3.4.1
Routing Information Protocol (RIP)
The Routing Information Protocol version 2 (RIP-2) defined in RFC 2453 [42] is a simple routing protocol
to be used within ASs. It is based on the exchange of distance vectors and thus falls into the class of distance vector routing protocols. The foundation of this protocol is the Bellman-Ford algorithm for computing
shortest paths in graphs.
Bellman-Ford Shortest Paths Algorithm
• Let G = (V, E) be a graph with the vertices V and the edges E with n = |V | and m = |E|.
• Let D be an n × n distance matrix in which D(i, j) denotes the distance from node i ∈ V to the node
j ∈V.
• Let H be an n × n matrix in which H(i, j) ∈ E denotes the edge on which node i ∈ V forwards a
message to node j ∈ V .
• Let M be a vector with the link metrics, S a vector with the start node of the links and D a vector with
the end nodes of the links.
1. Set D(i, j) = ∞ for i 6= j and D(i, j) = 0 for i = j.
2. For all edges l ∈ E and for all nodes k ∈ V : Set i = S[l] and j = D[l] and d = M [l] + D(j, k).
3. If d < D(i, k), set D(i, k) = d and H(i, k) = l.
4. Repeat from step 2 if at least one D(i, k) has changed. Otherwise, stop.
54
CHAPTER 3. INTERNET NETWORK LAYER
Properties
• Simple distance vector protocols like RIP have the property that good news propagates quickly while
bad news propagates relatively slowly.
• In particular, the failure of links can lead to situations where the bad news propagates slowly by
counting up the costs (count to infinity).
• RIP defines infinity to be 16 hops. Hence, RIP can only be used in networks where the longest paths
(the network diameter) is smaller than 16 hops.
• RIP uses the number of hops as the only metric.
Protocol
RIP-2 runs over the User Datagram Protocol (UDP) and uses normally the port number 520. All RIP-2
messages have the following structure:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Command
|
Version
|
must be zero
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
˜
RIP Entries
˜
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Command field indicates, whether the message is a request or a response. Response messages
can also be send without a previous request (unsolicited responses).
• The Version field contains the protocol version number.
• The RIP Entries field contains a list of so called fixed size RIP Entries.
A RIP Entry has the following structure:
0
1
2
3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Address Family Identifier
|
Route Tag
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
IP Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Subnet Mask
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Next Hop
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Metric
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Address Family Identifier field identifies an address family. RIP was originally developed for networks with different address formats.
• The Route Tag field marks entries which contain external routes, which might have been establised
by an EGP.
3.4. ROUTING PROTOCOLS
55
• The IP Address field contains an IPv4 destination address.
• The Subnet Mask field indicates the network prefix.
• The Next Hop field contains the IPv4 address of the next hop router to which packets to the destination specified by this route entry should be forwarded. Specifying a value of 0.0.0.0 indicates that
routing should be via the originator of the RIP advertisement.
• The Metric field contains a value between 1 and 15 inclusive. The value 16 is used when the
destination is not reachable (infinity).
The first RIP Entry can have a special format to support authentication:
0
1
2
3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
0xFFFF
|
Authentication Type
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
Authentication
+
|
|
+
+
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The constant 0xFFFF is used to distinguish an authentication entry from other entries.
• The Authentication Type field identifies an authentication scheme. The RIP-2 specification
only defines a simple cleartext password authentication scheme.
• The Authentication field contains data which is checked by the receiver to determine the authenticity of the message.
• RFC 2082 [43] defines an authentication scheme based on MD5 which uses an additional trailer at the
end of a RIP-2 message.
3.4.2
Open Shortest Path First (OSPF)
The Open Shortest Path First (OSPF) protocol [44] is a routing protocol within autonomous systems and
based on the idea that all nodes have access to the actual state of the links and the resulting topology (link
state routing). Every node independently computes the shortest paths to all the other nodes by using Dijkstra’s
shortest path algorithm. The link state information is distributed by flooding.
Dijkstra’s Shortest Paths Algorithm
1. All nodes are initially labeled with infinite costs indicating that the costs to reach the node are not yet
known. The cost label is marked as tentative and will be updated as the algorithm proceeds.
2. The costs of the root node are set to 0 and the root node is marked as the current node.
3. The cost label attached to the current node is marked permanent.
4. All direct adjacent nodes are now considered in turn. For each adjacent node, the costs for reaching
the node are calculated by taking the costs of the current node and adding the link costs for the link
that connects the current node to the adjacent node. If the resulting sum is smaller than the cost label
of the adjacent node, the cost label is updated with the new cost and the name of the current node.
56
CHAPTER 3. INTERNET NETWORK LAYER
5. If there are still nodes with tentative cost labels, a node with the smallest costs is selected as the new
current node. Goto step 3 if a new current node was selected.
6. The shortest paths to a destination node can now be read by following the labels from the destination
node towards the root.
OSPF Areas
• An OSPF area is a group of a set of networks within an autonomous system.
• The internal topology of an OSPF area is invisible for other OSPF areas. The routing within an area
(intra-area routing) is constrainted to that area.
• The OSPF areas are inter-connected via the OSPF backbone area (OSPF area 0). A path from a source
node within one area to a destination node in another area has three segments (inter-area routing):
1. An intra-area path from the source to a so called area border router.
2. A path in the backbone area from the area border of the source area to the area border router of
the destination area.
3. An intra-area path from the area border router of the destination area to the destination node.
• OSPF routers are classified according to their location in the OSPF topology:
1. Internal Router: A router where all interfaces belong to the same OSPF area.
2. Area Border Router: A router which connects multiple OSPF areas. An area border router has
to be able to run the basic OSPF algorithm for all areas it is connected to.
3. Backbone Router: A router that has an interface to the backbone area. Every area border router
is automatically a backbone router.
4. AS Boundary Router: A router that exchanges routing information with routers belonging to
other autonomous systems.
• Stub Areas are OSPF areas with a single area border router. The routing in stub areas can be simplified
by using default forwarding table entries which significantly reduces the overhead.
Protocol
OSPF messages are carried in IP packets. The value of the Protocol of the IPv4 header or the Next
Header of the IPv6 header is 89 for the OSPF protocol. All OSPF messages have the same header:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Version #
|
Type
|
Packet Length
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Router ID
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Area ID
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Checksum
|
AuType
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Authentication
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Authentication
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
3.4. ROUTING PROTOCOLS
57
• The Version field contains the OSPF version number, currently 2.
• The Type field identifies the message type.
• The Packet Length field contains the length of the whole OSPF message counted in bytes.
• The Router ID field identifies the router who originated an OSPF message.
• The Area ID identifies the OSPF area. The Area ID for the OSPF backbone is 0, often written in
the dotted quad notation as 0.0.0.0.
• The Checksum field contains the Internet checksum computed over the whole OSPF message without
the authentication field.
• The AuType field identifies the type of authentication procedure in use.
• The Authentication field contains authentication data. The format of the authentication data
depends on the authentication type.
Hello
The Hello protocol is used to test the status of links and the attached neighbors. The hello protocol works
differently on broadcast networks, non-broadcast multi-access networks and point-to-multipoint networks.
On broadcast and non-broadcast multi-access networks, the hello protocol selects a Designated Router and a
Backup Designated Router.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Version #
|
Type = 1
|
Packet length
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Router ID
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Area ID
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Checksum
|
AuType
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Authentication
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Authentication
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Network Mask
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
HelloInterval
|
Options
|
Rtr Pri
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
RouterDeadInterval
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Designated Router
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Backup Designated Router
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Neighbor
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
...
|
• Der ersten Felder enthalten den normalen OSPF-Nachrichtenkopf, wobei das Feld Type den Wert 1
hat.
58
CHAPTER 3. INTERNET NETWORK LAYER
• Das Feld Network Mask enth”alt die Netzmaske f”ur das Interface.
• Das Feld HelloInterval enth”alt das Zeitintervall in Sekunden zwischen aufeinanderfolgende
Hello-Nachrichten.
• Das Feld Rtr Pri enth”alt die Priorit”at des Routers, die f”ur die Auswahl des Designated bzw.
Backup Designated Routers verwendet wird. Router mit der Priorit”at 0 nehmen nicht an der Auswahl
teil.
• Das Feld RouterDeadInterval definiert das Zeitintervall in Sekunden, nachdem ein Router als
nicht mehr erreichbar betrachtet wird.
• Das Feld Designated Router enth”alt die Identit”at des Designated Routers bzw. 0 falls noch
kein Designated Router bekannt ist.
• Das Feld Backup Designated Router enth”alt die Identit”at des Backup Designated Router
bzw. 0 falls noch kein Backup Designated Router bekannt ist.
• Am Ende der Nachricht befindet sich eine Liste von Neighbor Feldern, wobei jedes Neighbor
Feld die Identit”at eines Routers anzeigt, von dem im letzten RouterDeadInterval eine HelloNachrichten empfangen wurde.
• Ein Link wird als verf”ugbar betrachtet, wenn Hello-Nachrichten in beide Richtungen ausgetauscht
werden k”onnen. Bei direkten Verbindungen (point-to-point links, virtual links) kann, sobald der Link
als verf”ugbar erkannt wurde, mit dem Austausch der Datenbasis begonnen werden.
• Bei Netzwerk-Verbindungen (broadcast links, non-broadcast links) wird zun”achst der Designated
Router und der Backup Designated Router bestimmt:
1. Zun”achst verh”alt sich ein Router f”ur ein RouterDeadInterval passiv indem er eingehende Hello-Nachrichten sammelt und eigene Hello-Nachrichten generiert, in denen er sich
nicht zu Wahl stellt. Anschlie”send werden nur die Nachbarn betrachtet, f”ur die der Link in
beide Richtungen verf”ugbar ist.
2. Wenn einer oder mehrere Router sich als Backup Designated Router angeboten haben, wird der
Router mit der h”ochsten Priorit”at ausgew”ahlt. Sollte die Priorit”at nicht eindeutig sein, wird
aus den Kandidaten der Router mit der gr”o”sten Identifikationsnummer ausgew”ahlt.
3. Wenn kein Router sich als Backup Designated Router angeboten hat, wird der Router mit der
h”ochsten Priorit”at (und der gr”o”sten Identifikationsnummer) ausgew”ahlt.
4. Wenn einer oder mehrere Router sich als Designated Router angeboten haben, wird der Router
mit der h”ochsten Priorit”at ausgew”ahlt. Sollte die Priorit”at nicht eindeutig sein, wird aus den
Kandidaten der Router mit der gr”o”sten Identifikationsnummer ausgew”ahlt.
5. Wenn kein Router sich als Designated Router angeboten hat, wird der Router mit der h”ochsten
Priorit”at (und der gr”o”sten Identifikationsnummer) ausgew”ahlt.
Ein Router kann nicht zugleich Designated Router und Backup Designated Router sein. Daher m”ussen
nach dem Schritt 5 die Schritte 2 und 3 wiederholt werden.
Exchange
Das Exchange-Protokoll hat die Aufgabe die Datenbasis initial zu synchronisieren.
...
Flooding
...
3.4. ROUTING PROTOCOLS
3.4.3
59
Border Gateway Protocol (BGP)
Autonomous systems usually perform policy-based routing by using the Border Gateway Protocol version
4 (BGP4) as defined in RFC 1771 [45] to exchange reachability information between autonomous systems
(ASs). The reachability information is sufficient to construct a graph of ASs connectivity from which routing
loops may be pruned and some policy decisions at the autonomous system level be enforced.
BGP4 runs over the reliable transport protocol TCP which eliminates explicit fragmentation, retransmission,
acknowledgement, and sequencing. BGP4 uses TCP port 179 for establishing connections between two
BGP4 peers which are typically located in different ASs.
When two ASs agree to exchange routing information, each AS must designate a router that will speak BGP4
on its behalf. These two routers are called BGP4 peers. The peers establish a TCP connection and run the
BGP4 protocol which basically has three phases:
1. The BGP4 peers exchange messages to open and confirm connection parameters.
2. The BGP4 peers exchange initially the entire BGP routing table. Incremental updates are sent as the
routing tables change.
3. The BGP4 peers exchange so called keep-alive messages periodically to ensure that the connection
and the BGP4 peers are alive.
BGP4 Message Header
Each BGP4 message has a fixed-size header which may or may not be followed by a data portion:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
+
|
Marker
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Length
|
Type
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Marker field contains a value (initially all 1s) that the receiver of the BGP4 message can predict and verify. The Marker field can be used to detect loss of synchronization and to authenticate
incoming BGP messages.
• The Length field indicated the total length of the message including the header, counted in bytes.
The maximum length of BGP4 messages is 4096 bytes.
• The Type field indicates the type of the message. The following message types are defined:
1. OPEN
2. UPDATE
3. NOTIFICATION
4. KEEPALIVE
60
CHAPTER 3. INTERNET NETWORK LAYER
It is important to realize that BGP peers in general advertise only routes that should be seen from the outside.
There might be additional possible routes which for policy reasons are not announced to other peers. Furthermore, it is important to realize that BGP only advertises routing information. The final decision which
paths are selected by putting approriate entries into the forwarding tables remains a local policy decision.
For some analysis about the usage of BGP, the growth of BGP routing tables and the increase of AS numbers,
see [46].
BGP4 Open Message
Once a TCP connection has been established between two BPG4 peers, they both send an OPEN message to
communicate their AS number and to establish other parameters.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+
|
Version
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Autonomous System Number
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Hold Time
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
BGP Identifier
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Opt Parm Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
Optional Parameters
|
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Version field contains the protocol version number.
• The Autonomous System Number field contains the 16-bit AS number of the sender.
• The Hold Time field specifies the maximum time that the receiver should wait for a response from
the sender.
• The BGP Identifier field contains a 32-bit value which uniquely identifies the sender. This
identifier is by definition selected from the IPv4 addresses of the sender.
• The Opt Parm Len field contains the total length of the Optional Parameters field or zero
if no optional parameters are present.
• The Optional Parameters field contains a list of parameters. Each parameter is encoded using
a tag-length-value (TLV) triple.
BGP4 Update Message
The UPDATE messages are used to transfer routing information between BGP peers. The information in the
UPDATE packet can be used to construct a graph describing the relationships of the various Autonomous
Systems.
An UPDATE message may simultaneously advertise a feasible route and withdraw multiple unfeasible routes
from service. Hence, the UPDATE message consists of two parts:
3.4. ROUTING PROTOCOLS
61
1. The list of unfeasible routes that are being withdrawn.
2. The feasible route to advertise.
+-----------------------------------------------------+
|
Unfeasible Routes Length (2 octets)
|
+-----------------------------------------------------+
| Withdrawn Routes (variable)
|
+-----------------------------------------------------+
|
Total Path Attribute Length (2 octets)
|
+-----------------------------------------------------+
|
Path Attributes (variable)
|
+-----------------------------------------------------+
|
Network Layer Reachability Information (variable) |
+-----------------------------------------------------+
• The Unfeasible Routes Length field indicates the total length of the Withdrawn Routes
field counted in bytes. The value 0 indicates that no routes are being withdrawn.
• The Withdrawn Routes field contains a list of IPv4 address prefixes that are being withdrawn
from service. Each IPv4 address prefix is encoded as a 2-tuple of the form (length, prefix) where
the length indicates the prefix length and prefix contains the IPv4 address prefix bits padded to
the next byte boundary.
• The Total Path Attribute Length field indicates the total length of the Path Attributes
field counted in bytes.
• The Path Attributes field contains a list of path attributes. Each attribute is encoded using a
tag-length-value (TLV) triple.
Path attributes convey information such as the origin of the path information (ORIGIN), the sequence
of AS path segments (AS PATH), the IPv4 address of the border router that should be used as the next
hop (NEXT HOP), or the local preference assigned by a BGP4 speaker (LOCAL PREF).
• The Network Layer Reachability Information field contains a list of IPv4 prefixes. address prefixes that are being withdrawn from service. Each IPv4 address prefix is encoded as a 2-tuple
of the form (length, prefix) where the length indicates the prefix length and prefix contains
the IPv4 address prefix bits padded to the next byte boundary.
BGP4 Notification Message
BGP4 supports a NOTIFICATION message type used for control or when an error occurs. The transport
connection is closed immediately after sending a NOTIFICATION message.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Error code
| Error subcode |
Data
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Error code field and the Error subcode field contain one of the following error codes:
62
CHAPTER 3. INTERNET NETWORK LAYER
Code
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
4
5
6
Description
Message Header Error
Message Header Error
Message Header Error
OPEN Message Error
OPEN Message Error
OPEN Message Error
OPEN Message Error
OPEN Message Error
OPEN Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
UPDATE Message Error
Hold Timer Expired
Finite State Machine Error
Cease
Subcode
1
2
3
1
2
3
4
5
6
1
2
3
4
5
6
7
8
9
10
11
Description
Connection Not Synchronized
Bad Message Length
Bad Message Type
Unsupported Version Number
Bad Peer AS
Bad BGP Identifier
Unsupported Optional Parameter
Authentication Failure
Unacceptable Hold Time
Malformed Attribute List
Unrecognized Well-known Attribute
Missing Well-known Attribute
Attribute Flags Error
Attribute Length Error
Invalid ORIGIN Attribute
AS Routing Loop
Invalid NEXT HOP Attribute
Optional Attribute Error
Invalid Network Field
Malformed AS PATH
Table 3.3: BGP4 error codes and error subcodes
BGP4 Keep Alive Message
BGP4 peers periodically exchange KEEPALIVE messages. A KEEPALIVE message consists of the standard
BGP4 header with no additional data. The KEEPALIVE messages are needed to verify that shared state
information is still present. If a BGP4 peer does not receive a message within the Hold Time, then the peer
will assume that there is a communication problem and tear down the connection.
Chapter 4
Internet Transport Layer
The transport layer is responsible for providing application protocols suitable transport services. Some application protocols require a stream-based connection while others prefer a reliable datagram service and yet
others are happy with a lightweight unreliable datagram service.
SMTP
HTTP
FTP
Transport Layer
IP Address + Port Number
IP Address
IP Layer
Figure 4.1: Transport layer multiplexing/demultiplexing using port numbers
• IP addresses are network layer endpoints and identify interfaces on nodes (hosts or routers)1 . Network
addresses have node-to-node significance.
• Transport layer endpoints identify communicating application processes and are in the Internet represented by a tuple consisting of an IP address and a 16-bit port number. Transport addresses have
end-to-end significance.
• The number space for port numbers is divided in a port number range that can freely be used and a port
number range which is managed by the Internet Assigned Numbers Authority (IANA). Well-known
port numbers for standardized or frequently used protocols can be registered by IANA.
• Port numbers basically allow to multiplex/demultiplex and packets at the transport layer as shown in
Figure 4.1.
There are currently four important transport protocols in the Internet:
1. The User Datagram Protocol (UDP) provides a simple unreliable best-effort datagram service.
1
It is worth to note that IP addresses are typically used to identify (interfaces of) nodes as well as their location in the
network. This dual role of IP addresses becomes interesting in the context of mobile devices.
63
64
CHAPTER 4. INTERNET TRANSPORT LAYER
2. The Transmission Control Protocol (TCP) provides a bidirectional, connection-oriented and reliable
data stream.
3. The Stream Control Transmission Protocol (SCTP) provides a reliable transport service supporting sequenced delivery of messages within multiple streams. SCTP maintains application protocol message
boundaries (application protocol framing) and was designed to support signaling protocols.
4. The Real-Time Transport Protocol (RTP) provides a transport service for real-time multi-media applications where different data streams have to be synchronized. RTP is often implemented on top of
UDP (and thus from the layering not a pure transport layer protocol).
4.1
Pseudo Header
Many Internet transport protocols contain an Internet checksum which is computed over the transport layer
header and a so called pseudo header which contains some selected and immutable fields of the IP header.
The IPv4 pseudo header consists of the IPv4 source and destination address plus the protocol number and the
length of the transport layer message [47, 48].
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Destination Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unused (0)
|
Protocol
|
Length
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The IPv6 pseudo header consists of the IPv6 source and destination address, the length of the transport layer
message and the next header field value which identifies the transport protocol.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
Source Address
+
|
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
+
|
|
+
Destination Address
+
|
|
+
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Upper-Layer Packet Length
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
zero
| Next Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
4.2. USER DATAGRAM PROTOCOL (UDP)
4.2
65
User Datagram Protocol (UDP)
The User Datagram Protocol (UDP) is defined in RFC 768 [47] and provides a simple unreliable best-effort
datagram service. The UDP protocol header basically extends the IP header with the source and destination
port numbers and a checksum. UDP packets are identified by the value 17 in the IPv4 Protocol field or
the IPv6 Next Header field. The UDP header has the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Port
|
Destination Port
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Length
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Source Port field contains the port number used by the sending application layer process.
• The Destination Port field contains the port number used by the receiving application layer
process.
• The Length field contains the length of the UDP datagram including the UDP header counted in
bytes.
• The Checksum field contains the Internet checksum computed over the pseudo header, the UDP
header and the payload contained in the UDP packet.
4.3
Transmission Control Protocol (TCP)
The Transmission Control Protocol (TCP) is defined in RFC 793 [48] and provides a bidirectional connectionoriented and reliable data stream over an unreliable connection-less network protocol. Applications exchange
an unstructured byte stream and the TCP connection can be used in a bidirectional and an unidirectional
mode. TCP provides end-to-end flow control using a windowing technique with adaptive timeouts and an
automatic slow-down in congestion situations.
The data stream provided by an application is split into so called segments for transmission. Every data
segment is prefixed with a TCP header before it is sent as the payload of an IP packet. The TCP header has
the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Port
|
Destination Port
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Sequence Number
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Acknowledgment Number
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Offset| Reserved |
Flags
|
Window
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Checksum
|
Urgent Pointer
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Options
|
Padding
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
66
CHAPTER 4. INTERNET TRANSPORT LAYER
• The Source Port field contains the port number used by the sending application layer process.
• The Destination Port field contains the port number used by the receiving application layer
process.
• The Sequence Number field contains after connection establishment the sequence number of the
first data byte in the segment. During connection establishment, this field is used to establish the initial
sequence number.
• The Acknowledgment Number field contains the next sequence number which the sender of the
acknowledgement expects.
• The Offset field contains the length of the TCP header including any options, counted in 32-bit
words.
• The Flags field contains a set of binary flags:
–
–
–
–
–
–
URG: Indicates that the Urgent Pointer field is significant.
ACK: Indicates that the Acknowledgment Number field is significant.
PSH: Data should be pushed to the application as quickly as possible.
RST: Reset of the connection.
SYN: Synchronization of sequence numbers.
FIN: No more data from the sender.
• The Window field indicates the number of data bytes which the sender of the segment is willing to
receive.
• The Checksum field contains the Internet checksum computed over the pseudo header, the TCP
header and the data contained in the TCP segment.
• The Urgent Pointer field points, relative to the actual segment number, to important data if the
URG flag is set.
• The Options field can contain additional options.
4.3.1
Connection Establishment
Two communicating TCP protocol engines have to agree on a number of parameters. The connection establishment procedure establishes these parameters using a three-way handshake protocol. The handshake
protocol guarantees correct connection establishment, even if TCP packets are lost or duplicated. In the
normal case (no packet loss), the three-way handshake is performed as shown in Figure 4.2.
Active Open
Passive Open
SYN x
SYN x
ACK x+1, SYN y
ACK x+1, SYN y
ACK y+1
ACK y+1
Figure 4.2: TCP three-way connection establishment handshake
67
4.3. TRANSMISSION CONTROL PROTOCOL (TCP)
• One of the TCP protocol engines first waits passively for incoming connections (passive open).
• The other TCP protocol engine actively initiates the connection establishment procedure (active open).
• The first TCP packet contains the initial sequence number in a SYN packet. The initial sequence
number is determined by a counter which is incremented roughly all 4 microseconds. This guarantees
that the initial sequence number is not reused as long as old packets may still exist in the network.
• The passive TCP engine stores the received sequence number and sends its own randomly created
initial sequence number, at the same time acknowledging the received SYN packets.
• The active TCP engine stores the received the sequence number and acknowledges the receipt of this
sequence number.
4.3.2
Connection Tear-down
TCP provides initially a bidirectional data stream after completing the connection establishment procedure.
It is possible to turn the bidirectional connection into a unidirectional connection by closing one half of the
connection. The TCP connection itself is terminated when both unidirectional connections have been closed.
In the normal case (no packet loss), the connection tear-down is performed as shown in Figure 4.3.
Active Open
Passive Open
FIN x
FIN x
ACK x+1
ACK x+1
ACK x+1, FIN y
ACK x+1, FIN y
ACK y+1
ACK y+1
Figure 4.3: TCP connection teardown
• The connection tear-down procedure is started by a TCP protocol engine by setting the FIN flag.
• The receiver usually first acknowledges the receipt of the FIN packet.
• The receiving protocol engine the informs the application about the tear-down of the first half of the
connection.
• Once the application indicates that it wants to close the other half of the connection, another TCP
packet is transmitted into the other direction with a FIN flag set.
• The receiver of the second FIN packet acknowledges the receipt of the second FIN packet and the
connection is closed.
In cases where a connection between two TCP engines is interrupted (e.g., a cable breaks or a node is turned
off), the TCP specification requires quiet time of 120 seconds (maximum segment lifetime, MSL) before new
TCP connections can be established. The quiet time is motivated by the time needed to ensure that packets
belonging to the broken TCP connection have disappeared from the network.
68
4.3.3
CHAPTER 4. INTERNET TRANSPORT LAYER
State Machine
The various transitions possible during connection establishment and tear-down are best described by a finite
state machine as shown below.
+---------+ ---------\
active OPEN
| CLOSED |
\
----------+---------+<---------\
\
create TCB
|
ˆ
\
\ snd SYN
passive OPEN |
|
CLOSE
\
\
------------ |
| ---------\
\
create TCB |
| delete TCB
\
\
V
|
\
\
+---------+
CLOSE
|
\
| LISTEN |
---------- |
|
+---------+
delete TCB |
|
rcv SYN
|
|
SEND
|
|
----------|
|
------|
V
+---------+
snd SYN,ACK /
\
snd SYN
+---------+
|
|<---------------------------------->|
|
|
SYN
|
rcv SYN
|
SYN
|
|
RCVD |<-----------------------------------------------|
SENT |
|
|
snd SYN, ACK
|
|
|
|------------------------------------|
|
+---------+
rcv ACK of SYN \
/ rcv SYN,ACK
+---------+
|
-------------|
|
----------|
x
|
|
snd ACK
|
V
V
| CLOSE
+---------+
| ------| ESTAB |
| snd FIN
+---------+
|
CLOSE
|
|
rcv FIN
V
------|
|
------+---------+
snd FIN /
\
snd ACK
+---------+
| FIN
|<---------------------------------->| CLOSE |
| WAIT-1 |-----------------|
WAIT |
+---------+
rcv FIN \
+---------+
| rcv ACK of FIN
------|
CLOSE |
| -------------snd ACK
|
------- |
V
x
V
snd FIN V
+---------+
+---------+
+---------+
|FINWAIT-2|
| CLOSING |
| LAST-ACK|
+---------+
+---------+
+---------+
|
rcv ACK of FIN |
rcv ACK of FIN |
| rcv FIN
-------------- |
Timeout=2MSL -------------- |
| ------x
V
-----------x
V
\ snd ACK
+---------+delete TCB
+---------+
------------------------>|TIME WAIT|------------------>| CLOSED |
+---------+
+---------+
The TCP state machine has the states shown in Table 4.1.
69
4.3. TRANSMISSION CONTROL PROTOCOL (TCP)
CLOSED
LISTEN
SYN-RCVD
SYN-SENT
ESTABLISHED
FIN-WAIT-1
FIN-WAIT-2
TIMED-WAIT
CLOSING
CLOSE-WAIT
LAST-ACK
Initial and final state
Wait for incoming connection requests (passive open)
Received connection request (passive open)
Initiated connection establishment (active open)
Connection is established and operational
Started connection tear-down procedure
Waiting for connection tear-down form remote end
Waiting for remote engine to receive tear-down acknowledgement
Both engines close simultaneously
Remove engine started connection tear-down procedure
Wait for last acknowledgement or until all segments have disappeared
Table 4.1: TCP protocol engine states
4.3.4
Flow Control
TCP uses a windowing approach to implement flow control. During connection establishment, both TCP engines advertise their buffer sizes. The available space left in the receiving buffer is advertised as part of the acknowledgements.
Senders must not send more data in order to protect the receiver’s buffers. The only exception is the window size 0. If
the window has reached a size of 0 bytes, the sender still may send data in the following two cases:
1. The sending application delivers urgent data that should be transmitted and delivered to the remote application as
fast as possible.
2. The sending application may send a 1 byte segment in order to make the receiver reannounce the next byte
expected and the current window size. This is useful to protect against deadlocks that can otherwise occur if a
window update is lost.
The following illustrative example shown in Figure 4.4 is taken from [2].
Sender
Receiver
0
write(2K)
4K
2K | SEQ = 0
0
4K
ACK = 2048 Win = 2048
write(2K)
2K | SEQ = 2048
0
ACK = 4096 Win = 0
4K
read(2K)
4K
0
ACK = 4096 Win = 2048
Figure 4.4: TCP flow control
Suppose the receiver has a 4096 byte buffer and the sender has 2048 bytes ready to send. The sender will immediately
transmit a 2048 byte segment (assuming that the path MTU is large enough). The receiver now fills half of the buffer and
announces a new window size of 2048 bytes in the acknowledgement. Once the application has 2048 more bytes to send,
another 2048 byte segment is transmitted. This now fully fills the receiver’s buffer which leads to an announcement of
70
CHAPTER 4. INTERNET TRANSPORT LAYER
a window of size 0 in the following acknowledgement. Once the receiver’s application process consumes data, another
acknowledgement will be created which informs the sender of a new window size.
The TCP specification does not require that acknowledgements are created immediately for each received segment. This
allows for optimizations where a receiver might choose to send just a single acknowledgement for several segments that
have been received quickly in sequence. Furthermore, the receiving TCP engine might choose to send the acknowledgement delayed so that the acknowledgement can be piggybacked on other data send from the receiver to the sender.
Nagle’s Algorithm
The original TCP behaves rather ineffective in situations where an application sends a stream of very small (one byte)
payloads. In the extreme case, the sender sends a segment containing one byte payload. The receiver responds with
an acknowledgement for that single byte. One the received byte has been copied to the application process, another
acknowledgement is send to advertise a new window size. The sending application may now have another byte to send
and the process repeats.
Nagle suggested to solve this problem by introducing the following rule: When data comes into the sender one byte at
a time, just send the first byte and buffer all the rest until the byte in flight has been acknowledgement. This algorithm
provides noticeable improvements especially for interactive traffic where a quickly typing user is connected over a rather
slow network.
Clark’s Algorithm
Another related problem is known as the silly window syndrome. This problem deals with applications on the receiving
side that read the data one byte at a time from the receiver’s buffer. The original TCP implementations immediately
announced a window of one byte when the application removed a byte from the receive buffer. This acknowledgement
then causes the transmission of another TCP segment which contains again just one byte of data.
Clark suggested to solve this problem by preventing the receiver from sending a window update of 1 byte. Specifically,
the receiver should not send a window update until it can handle the maximum segment size it advertised when the
connection was established or until its buffer is half empty, whichever is smaller.
4.3.5
Congestion Control
A detailed discussion of the congestion control mechanism used by TCP can be found in RFC 2581 [49] while RFC 3390
[50] increases the size of the initial window. The following text is a rather short summary of these RFCs.
TCP’s congestion control introduces the concept of a congestion window (cwnd) which defines how much data can be
in transit. The congestion window is maintained by a TCP sender in addition to the flow control receiver window (rwnd)
which is advertised by the receiver. The sender uses these two windows to limit the data that is sent to the network and
not yet received (flight size) to the minimum of the receiver and the congestion window:
f lightsize ≤ min(cwin, rwin)
The key problem to be solved is the dynamic estimation of the congestion window. The solution adopted by TCP
assumes that lost segments are indications of congestion. While this is true in most wired networks, this assumption
does not work that well in wireless networks where the loss rate is much higher. Recent work also introduced explicit
congestion notifications which can be used by a router in the network to indicate congestion without having to drop
packets.
A TCP connection usually has different phases where different congestion control techniques should be used:
• After connection establishment, no suitable value for the congestion window is known. The solution is to define
in initial window size (IW) and to start probing the network for the real congestion window using the slow start
algorithm.
• Once a certain threshold, the so called slow start threshold (ssthresh) for the congestion window has been
crossed, the connections enters the congestion avoidance phase in which the congestion window increases linearly.
71
4.3. TRANSMISSION CONTROL PROTOCOL (TCP)
• If a timeout occurs or congestion is signalled by other means, the slow start threshold ssthresh is reduced and
the congestion window is set to the so called loss window (LW) which is one full-sized segment. The sender now
switches back to the slow start algorithm until ssthresh is crossed and congestion avoidance takes over.
• After a long period if idle time, the congestion window cwin is usually not accurate anymore. Hence, the value of
the congestion window must be set to the restart window (RW) which is typically the same as the initial window
(IW) and the slow start algorithm is executed.
The initial window (IW) is usually initialized using the following formula:
IW = min(4 · SM SS, max(2 · SM SS, 4380bytes))
In this formula, SM SS is the sender maximum segment size, the size of the largest sement that the sender can transmit.
The size does not include the TCP/IP headers and options.
During slow start, the congestion window cwnd increases by at most SM SS bytes for every acknowledgement received
that acknowledges data. Slow start ends when cwnd exceeds ssthresh or when congestion is observed. The initial value
of ssthresh may be arbitrarily high. Some implementations use the size of the advertised window. Note that this leads
to an exponential increase if there are multiple segments acknowledged in the cwnd.
During congestion avoidance, cwnd is incremented by one full-sized segment per round-trip time (RTT). Congestion
avoidance continues until congestion is detected. One formula commonly used to update cwnd during congestion avoidance is given by the following equation:
cwnd = cwnd + (SM SS ∗ SM SS/cwnd)
This adjustment is executed on every incoming non-duplicate ACK. The equation provides an acceptable approximation
to the underlying principle of increasing cwnd by one full-sized segment per RTT.
When congestion is noticed during slow start of congestion avoidance, (the retransmission timer expires), then the slow
start threshold ssthresh is updated as follows:
ssthresh = max(f lightsize/2, 2 · SM SS)
The flight size is the amount of outstanding data in the network. The impact of slow start and congestion avoidance is
summarized in Figure 4.5
44
40
congestion window (cwin)
36
timeout
32
28
ssthresh
24
20
16
12
8
4
0
2
4
6
8
10
12
14
16
18
20
transmission number
Figure 4.5: TCP slow start and congestion avoidance
22
24
26
28
72
CHAPTER 4. INTERNET TRANSPORT LAYER
Fast Retransmit / Fast Recovery
To reduce the time for retransmissions, TCP receivers should send an immediate duplicate acknowledgement when
an out-of-order segment arrives. The purpose of this acknowledgement is to inform the sender that a segment was
received out-of-order and which sequence number is expected. In addition, a TCP receiver should send an immediate
acknowledgement when the incoming segment fills in all or part of a gap in the sequence space.
TCP senders should use the fast retransmit algorithm to detect and repair loss. The fast retransmit algorithm uses the arrival of three duplicate acknowledgements (four identical acknowledgements without the arrival of any other intervening
packets) as an indication that a segment has been lost. After receiving three duplicate acknowledgements, TCP performs
a retransmission of what appears to be the missing segment, without waiting for the retransmission timer to expire.
The fast recovery algorithm controls how the congestion window and the slow start threshold is updated when the fast
retransmit algorithm is used. The basic idea is to not exercise the normal congestion reaction with a full slow start since
acknowledgements are still flowing. For details how fast retransmit and fast recovery are implemented, see Section 3.2
in RFC 2581 [49]
4.3.6
Retransmission Timer
The retransmission timer controls when a segment is resend if no acknowledgement has been received. Setting the
retransmission timer to a reasonable value is rather difficult in the context of TCP since the round-trip time varies and the
mean and the variance of the round-trip time distribution can change rapidly within a few seconds as congestion builds
up or is resolved.
The solution adopted in TCP is to measure the round-trip time constantly and to adjust the timeout interval. For each
connection, TCP maintains an estimation of the current round-trip time in a variable called RT T . If an acknowledgement
is received for a segment before the associated retransmission timer expires, the estimation of the round-trip time is
updated as follows:
RT T = α · RT T + (1 − α)M
The parameter α is a smoothing factor and typically set to α = 78 . M is the measured round-trip time, that is the time
that has passed between sending the segment and receiving the corresponding acknowledgement.
In order to determine a good value for the retransmission time, the variance of the round-trip time distribution also has to
be taken into account. TCP uses a cheap estimator for the standard deviation which is very efficient to compute:
D = α · D + (1 − α)|RT T − M |
The parameter α is another smoothing factor which can be different from the smoothing factor used to estimate RT T .
With these estimations of the average round-trip time and the standard deviation, the retransmission timeout RT O is
usually set as follows:
RT O = RT T + 4 · D
The factor 4 is more or less chosen by doing experiments. According to some studies, less than one percent of all packets
come in more than four standard deviations late.
Karn’s Algorithm
The dynamic estimation of the RT T has a problem if a timeout occurs and the segment is retransmitted. A subsequent acknowledgement might acknowledge the receipt of the first packet which contained that segment or any of the
retransmissions. Guessing wrong can seriously impact the RT T estimation.
Karn therefore suggested that the RT T estimation is not updated for any segments which were retransmitted. Furthermore, Karn suggested that the RT O is doubled on each failure until the segment gets through which leads to an
exponential back-off for each consecutive attempt. These fixes are now known as Karn’s algorithm.
4.4. STREAM CONTROL TRANSMISSION PROTOCOL (SCTP)
4.4
73
Stream Control Transmission Protocol (SCTP)
The Stream Control Transmission Protocol (SCTP) is defined in RFC 2960 [51]. An shorter introduction can be found in
RFC 3286 [52]. SCTP provides provides a reliable transport service, ensuring that data is transported without error and
delivered in sequence. SCTP is message oriented and preserves the boundaries of application layer messages (application
layer framing).
Data transfer between two SCTP hosts takes place in the context of an association. An association may contain multiple
data streams and each stream has the property of independently sequenced delivery. A message loss in one stream thus
does not affect the delivery in other streams. SCTP accomplishes multi-streaming by creating independence between
data transmission and data delivery.
Unlike TCP, SCTP allows SCTP endpoints to have multiple IP addresses. This multi-homing feature provides the benefit
of potentially greater survivability of an SCTP association in the presence of network failures.
An SCTP packet is composed of a common header and chunks. A chunk contains either control information or user data.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Common Header
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Chunk #1
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
...
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Chunk #n
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Multiple chunks can be bundled into one SCTP packet up to the MTU size. If a user data message doesn’t fit into one
SCTP packet it can be fragmented into multiple chunks.
The SCTP common header has the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Port
|
Destination Port
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Verification Tag
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Source Port field contains the port number used by the sending application layer process.
• The Destination Port field contains the port number used by the receiving application layer process.
• The Verification Tag field is used by the receiver to validate the sender of the SCTP packet. This field
protects the SCTP protocol against certain attacks.
• The Checksum field contains a 32-bit CRC checksum as specified in RFC 3309 [53].
Payload is transmitted in so called data chunks. Data chunks have the following format:
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Type = 0
|
Flags
|
Length
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Transmission Sequence Number (TSN)
|
74
CHAPTER 4. INTERNET TRANSPORT LAYER
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Stream Identifier (S)
|
Stream Sequence Number (n) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Payload Protocol Identifier
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
\
\
/
User Data (seq n of Stream S)
/
\
\
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Type field indicates the chunk type. Data chunks use the type number 0.
• The Flags field contains a set of binary flags:
– U: The data chunk is unordered and there is no Stream Sequence Number assigned to the data chunk.
– B: Indicates the beginning of a fragment of a user message.
– E: Indicates the ending of a fragment (last fragment) of a user message.
• The Length field indicates the size of the chunk in bytes including the chunk header fields.
• The Transmission Sequence Number field contains the transmission sequence number which is also
used by the receiver to reassemble messages.
• The Stream Identifier identifies the stream to which the following data belongs.
• The Stream Sequence Number identifies the stream sequence number of the following user data within the
stream identified by the Stream Identifier.
• The Payload Protocol Identifier identifies the upper layer application protocol and is opaque from
the viewpoint of an SCTP protocol engine.
• The User Data is of variable length and contains the actual payload.
4.5
Datagram Congestion Control Protocol (DCCP)
The Datagram Congestion Control Protocol (DCCP) is the latest addition to the set of Internet transport protocols. It
provides a congestion controlled, unreliable flow of datagrams suitable for use by applications such as streaming media.
DCCP is connection oriented and has a connections setup, data exchange and teardown phase. However, in contrast to
TCP and SCTP, DCCP does not provide ordered delivery nor does it provide end-to-end flow control. DCCP supports
path MTU discovery and explicit congestion notifications (ECNs).
The congestion control mechanism used for a DCCP connection can be negotiated. DCCP therefore supports different
congestion control mechanisms which are identified by so call congestion control identifiers (CCIDs). A TCP-like
congestion control mechanism is identified by CCID 2. A TCP-Friendly Rate Control (TFRC) mechanism is identified
by CCID 3.
DCCP uses nine different packet types, which are typicall used in the following sequence:
1. A DCCP-Request is used by the client to initiate a connection with a server.
2. A DCCP-Response is send by the server to indicate that it is willing to talk to a client.
3. A DCCP-Ack packet is send by the client to acknowledge to a DCCP-Response. Further DCCP-Ack packets
might be exchanged to negotiate options.
4. After the connection has been established, DCCP-Data, DCCP-Ack and DCCP-DataAck packets are exchanged to exchange payload data and acknowledgements.
5. The server sends a DCCP-CloseReq packet requesting to close the connection.
6. The client sends a DCCP-Close packet acknowledging the request to close the connection.
7. The server sends a DCCP-Reset packet to clear the connection state.
The connection teardown can also be started by the client. In this case, the client sends a DCCP-Close to close the
connection. The server responds with a DCCP-Reset packet to clear the connection state.
All DCCP packets use a common header:
4.5. DATAGRAM CONGESTION CONTROL PROTOCOL (DCCP)
75
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Port
|
Dest Port
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | CCval |
Sequence Number
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data Offset | # NDP | Cslen |
Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
• The Source Port field contains the port number used by the sending application layer process.
• The Destination Port field contains the port number used by the receiving application layer process.
• The Type field indicates the type of the DCCP message.
• The CCval field contains data which might be used by a congestion control mechanism.
• The Sequence Number field contains a sequence number which counts the number of packets.
• The Data Offset field contains the offset from the start of the DCCP header to the start of the payload,
counted in 32-bit words.
• The NDP field contains the number of non-data packets send on the senders sequence, modulo 16.
• The Cslen field specifies what parts of the packet are covered by the checksum field. The checksum always
covers at least the DCCP header, DCCP options, and a pseudoheader taken from the network-layer header.
• The Checksum field contains the Internet checksum computed over the pseudo header, the DCCP header, any
DCCP options and, depending on the Cslen value, over some of the payload.
76
CHAPTER 4. INTERNET TRANSPORT LAYER
Chapter 5
Internet Application Layer
The application system is not very much structured. Many protocols run directly on top of the transport protocols and it is
not unusual that the various protocols solve recurring problems (such as data encoding) in different ways. Although this
approach does not seem to be very efficient from a global perspective, it must be noted that the individual point solutions
have been very successful in practice.
There are many application layer protocols in use in the Internet. This chapter focusses on those protocols which either
realize certain core services (e.g., DNS) or which carry a significant portion of the traffic (e.g., HTTP, FTP, SMTP) or
which are interesting because of some of their special features.
5.1
Domain Name System (DNS)
The Internet uses bit sequences (IPv4/IPv6 addresses and port numbers) to address nodes and transport endpoints. These
bit sequences can be processed efficiently by machines but they are hard to remember by humans. The Domain Name
System (DNS) specified in RFC 1034 [54] and RFC 1035 [55] provides a global infrastructure to map human friendly
domain names into IP addresses and vice versa.
virtual root
nl
de
edu
org
net
com
biz
top level
iu−bremen
2nd level
eecs
3rd level
www
4th level
Figure 5.1: Structure of Domain Name System (DNS) names
• The Domain Name System provides a hierarchical name space with a virtual root. The administration of the name
space can be delegated along the paths starting from the virtual root.
• Name resolution is realized by so called DNS servers. A DNS server knows a part (a zone) of the global name
space and its position within the global name space. Note that some parts of the name space might be further
delegated to other name servers.
• Name resolution queries can in principle be sent to arbitrary DNS servers. However, it is good practice to use a
local DNS server as the primary DNS server. Non-local servers might be used as backup servers.
• Recursive queries cause the queried DNS server to contact other DNS servers as needed in order to obtain a
response to the query. This is convenient for the DNS client but requires more complexity in the DNS server.
The alternative are iterative queries where the client may have to sent a series of queries to several DNS servers
to retrieve the desired information.
77
78
CHAPTER 5. INTERNET APPLICATION LAYER
• The original DNS protocol does not provide sufficient security. In particular, there is no guarantee that the
returned response is trustworthy. (Does a request to obtain an IP address for the name www.my-bank.com
really return an IP address of my bank?) Furthermore, if the DNS would be secure, then it could be used as
an infrastructure to distribute for example certificates used by other security mechanisms. Although there are
standards for secure DNS (RFC 2535 [56]), they are not as widely used as they should be.
• Since well known DNS names became a trade item in recent years, there is quite some political debate around the
question who is actually in charge to define new top-level domain names. At the moment, this responsibility lies
in the hands of the Internet Corporation for Assigned Names and Numbers (ICANN). However, ICANN itself is
an organization that is not everywhere well accepted.
Cache
Encompassing
Name Server
API
RR DB
DNS
Application
Resolver
DNS
Primary
Name Server
Cache
Client Machine
RR DB
Cache
DNS Server Machines
Figure 5.2: Recursive name resolution with two DNS servers
The global Domain Name System basically consists of three components:
1. The hierarchical name space and the so called Resource Records (RRs) that hold typed information for a given
name. There are several standardized resource record types and new resource record types can be defined in order
to store additional information in the domain name system.
2. A set of DNS servers which provide access to the information stored in resource records via the DNS protocol.
DNS server usually also maintain some information in local caches in order to increase the overall efficiency of
the DNS system. DNS server usually have authoritative information about one or multiple local zones which they
are responsible for.
3. A set of resolver libraries, usually shipped as part of the operating system, which contain an implementation of
a DNS client provide a programmatic API to resolve names to addresses and vice versa (see getaddrinfo()
and getnameinfo()). Application programs call the resolver system library functions to resolve DNS and
potentially other names.
5.1.1
Format of Domain Names
The format of domain names is defined in RFC 1034 [54]. The following rules apply for traditional domain names:
• The names (labels) on a certain level of the tree must be unique and may not exceed 63 byte in length. The
character set for the labels is seven bit ASCII. Comparisons are done in a case-insensitive manner.
• The labels must begin with a letter and end with a letter or decimal digit. The characters between the first and last
character must be letters, digits or hyphens.
• The labels can be concatenated with dots to form paths within the name space. Absolute paths which end at the
virtual root node end with a trailing dot. All other paths which do not end with a trailing dot are relative paths.
• The overall length of a domain name is limited to 255 bytes.
5.1. DOMAIN NAME SYSTEM (DNS)
Type
A
AAAA
CNAME
HINFO
MX
NS
PTR
SOA
KEY
SIG
79
Description
IPv4 addresse
IPv6 addresse
Alias for another name (canonical name)
Identification of the CPU and the operating system (host info)
List of mail server (mail exchanger)
Identification of an authoritative server for a domain
Pointer to another part of the name space
Start and parameters of a zone (start of zone of authority)
Public key associated with a name
Signature over the RRs associated with a name
Table 5.1: DNS resource record types
Recent efforts did result in proposals for Internationalized Domain Names in Applications (IDNA) (RFC 3490 [57], RFC
3491 [58], RFC 3492 [59]). The basic idea is to support internationalized character sets within applications. However, for
backward compatibility reasons, internationalized character sets are encoded into seven bit ASCII representations (ASCII
Compatible Encoding, ACE). ACE labels are recognized by a so called ACE prefix. The ACE prefix for IDNA is xn--. A
label which contains an encoded internationalized name might for example be the value xn--de-jg4avhby1noc0d.
5.1.2
Resource Records
Information associated with a DNS name is stored in so called resource records (RRs). Every resource records has to
following attributes:
• The owner is the domain name which identifies a resource record.
• The type indicates the kind of information that is stored in a resource record. The most important types are defined
in RFC 1034 [54], RFC 1886 [60], RFC 2535 [56], and RFC 3596 [61] and is summarized in Table 5.1.
• The class indicates the protocol specific name space. The DNS protocol was originally designed to support other
namespaces in addition to the Internet name space. The predominant class today is IN (Internet) while some
older installations also supported the class CH (Chaos System).
• The time to life (TTL) defines how long (counted in seconds) information from a resource record can be stored in
a local cache.
• The data format (RDATA) of a resource records depends on the type of the resource record.
5.1.3
DNS Message Formats
All DNS messages have the same structure. A DNS message has five parts:
1. A DNS message starts with a protocol header. It indicates which of the following four parts is presents and
whether the message is a query or a response.
2. The header is followed by a list of questions.
3. The list of questions is followed by a list of answers (resource records).
4. The list of answers is followed by a list of pointers to authorities (also in the form of resource records).
5. The list of pointers to authorities is followed by a list of additional information (also in the form of resource
records). This list may contain for example A resource records for names in a response to an MX query.
The DNS protocol normally runs over UDP for simple queries. For larger data transfers (e.g. zone transfers), DNS may
utilize TCP. Both protocols use the well-known port number 53.
80
CHAPTER 5. INTERNET APPLICATION LAYER
Message Header
All DNS messages start with the common header which has the following format:
0
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
ID
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR|
Opcode |AA|TC|RD|RA| Z|AD|CD|
RCODE
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
QDCOUNT
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
ANCOUNT
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
NSCOUNT
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
ARCOUNT
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
• The ID field contains a number which allows to correlate incoming responses with outstanding requests.
• The QR bit is 0 in a query message and 1 in a response message.
• The field OPCODE indicates the query type. This field is 0 for a standard query (QUERY) and 1 for an inverse
query (IQUERY).
• The AA bit is set if the response is authoritative (authoritative answer).
• The TC bit indicates that the message is truncated due to restrictions of the transport system (truncated).
• The RD bit is set on queries to start a recursive query (recursion desired).
• The RA bit is set on responses and denotes whether recursive query support is available (recursion available).
• The Z bit is unused.
• The AD bit indicates that all data in the response has been cryptographically verified or otherwise meets the DNS
server’s local security policy.
• The CD bit is set when a resolver ist willing to accept non-authenticated data (checking disabled).
• The RCODE field contains an error code which is only significant in response messages.
• The QDCOUNT field contains the number of queries in the query list.
• The ANCOUNT field contains the number of responses in the answer list.
• The NSCOUNT field contains the number of authoritative name servers in the list of authoritative name servers.
• The ARCOUNT field contains the number of elements in the list of additional information.
Query Format
The elements in the list of queries have the following format:
0
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
|
/
QNAME
/
/
/
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
QTYPE
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
QCLASS
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
5.1. DOMAIN NAME SYSTEM (DNS)
81
• The QNAME field contains the domain name which is being queried.
• The 16-bit field QTYPE determines the type of the query. Some defined values are 1 (A), 2 (NS), 5 (CNAME), 6
(SOA), 12 (PTR), 13 (HINFO), 15 (MX), 24 (SIG), 25 (KEY), and 28 (AAAA).
• The 16-bit field QCLASS determines the class. This field usually contains the value 1 for the Internet (IN).
Response Format
The list of answers, the list of authoritative servers and the list with additional information all have the same structure:
0
1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
|
/
/
/
NAME
/
|
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
TYPE
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
CLASS
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
TTL
|
|
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|
RDLENGTH
|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
/
RDATA
/
/
/
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
• The NAME field contains the domain name associated with the following resource record.
• The 16-bit TYPE field indicates the type ot the resource records and determines the format of the RDATA field.
• The 16-bit CLASS field indicates the class. This field usually contains the value 1 for the Internet (IN).
• The 16-bit TTL field contains the lifetime for the following resource record in seconds.
• The 16-bit RDLENGTH field contains the length of the following RDATA field.
• The RDATA field contains the actual data in form or a resource record. The format of the RDATA field depends
on the resource record type indicated by the TYPE field.
Resource Record Formats
The various resource records have the following formats:
• An A resource record contains an IPv4 address encoded in 4 bytes in network byte order.
• An AAAA resource record contains an IPv6 address encoded in 16 bytes in network byte order.
• A CNAME resource record contains a character string preceded by the length of the string which is encoded in the
first byte (and thus restricts the string to 255 characters).
• A HINFO resource record contains two character strings, each prefixed with a length byte. The first character
string describes the CPU and the second string the operating system.
• A MX resource record contains a 16-bit preference number (used to prioritize multiple entries) followed by a
character string prefixed with a length bytes. The character string contains the DNS name of a mail exchanger.
• A NS resource record contains a character string prefixed by a length byte which contains the name of an authoritative DNS server.
82
CHAPTER 5. INTERNET APPLICATION LAYER
• A PTR resource record contains a character string prefixed with a length byte which contains the name of another
DNS server. PTR records are used to map IP addresses to names (so called reverse lookups). For an IPv4
address of the form d1 .d2 .d3 .d4 , a PTR resource record is created for the pseudo domain name d4 .d3 .d2 .d1 .in −
addr.arpa. For an IPv6 address of the form h1 h2 h3 h4 : . . . : h13 h14 h15 h16 , a PTR resource record is created
for the pseudo domain name h16 .h15 .h14 .h13 . . . . .h4 .h3 .h2 .h1 .ip6.arpa
• A SOA resource record contains two character strings, each prefixed by a length byte, and four 32-bit numbers.
The first character string contains the name of the DNS server responsible for a zone. The second character string
contains the mail address of the administrator responsible for the management of the zone. The first unsigned 32bit number contains a serial number (SERIAL) which must be incremented by the zone administrator whenever he
makes changes to the zone database. The second 32-bit number defines the time which may elapse before cached
zone information must be updated (REFRESH). The third 32-bit number defines a time interval after which zone
information is consider not current anymore (EXPIRE). The fourth 32-bit number contains the minimum lifetime
for resource records (MINIMUM).
• The KEY and SIG resource records have a rather complex format which is described in detail in RFC 2535 [56].
5.2
Abstract Syntax Notation One (ASN.1)
The Abstract Syntax Notation One (ASN.1) [62, 63, 64] is a language for the definition of data structures and message
formats which was developed in the 1980s. ASN.1 has been standardized by the ITU. Note that current versions of
ASN.1 differ significantly from earlier versions of ASN.1.
Abstract Syntax (ASN.1)
Local Syntax
Local Syntax
De/Encoder
Data Exchange
De/Encoder
System A
System B
Transfer Syntax (ASN.1 Encoding Rules)
Figure 5.3: Abstract syntax, local syntax and transfer syntax
ASN.1 was primarily developed to formally describe data structures that are exchanged between applications in a distributed system. The idea was to let developers focus on the definition of data structures and to give developers tools to
generate the necessary encoding / decoding functions. Some more specific requirements for ASN.1 were:
• Exchange of information between machines with different hardware architectures.
• Independence of existing programming languages (neutrality).
• Sender and receiver should be able to choose one out of multiple data encoding formats that suits their needs.
The fundamental principle behind ASN.1 is the separation of the data representation during transmission from the data
representation within applications that might be written in different programming languages (see Figure 5.3).
83
5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1)
• The abstract syntax defines data structures in an application implementation neutral format. The abstract syntax
is mapped to a local syntax which is used in a concrete implementation. The local syntax is specific for the
programming language being used to develop the application and might be different for different implementations.
ASN.1 compilers can be used to generate the local syntax from the abstract syntax.
• The transfer syntax defines how data structures are serialized for transmission over a network. A concrete transfer
syntax is commonly defined as a set of encoding rules which define how values of the various ASN.1 types are
encoded. There are multiple encoding rules for ASN.1 and it is therefore necessary that applications agree on the
transfer syntax to use. The implementation of the encoding and decoding functions can be automated by using
ASN.1 compilers.
5.2.1
Basic Concepts
ASN.1 definitions basically consist of type definitions and value definitions that are organized into ASN.1 modules.
• Names of ASN.1 data types always begin with an uppercase character.
• Names of ASN.1 values (constants) always begin with a lowercase character.
• ASN.1 keywords and macro names only contain uppercase characters.
• Comments begin with two hyphens (--) and they end either at the end of the line or at the next occurance of two
hyphens (--).
5.2.2
ISO Registration Tree
It is often necessary to uniquely identify artefacts such as specific protocol definitions or parameters. The ISO registration
tree provides a hierarchical name system that can be used for this purpose. Note that the hierarchical structure makes
it is possible to delegate authority. For example, the US Department of Defense (dod) has delegated authority over the
internet subtree to the Internet Assigned Number Authority (IANA).
itu−t(0) / ccitt(0)
standard(0)
iso(1)
registration−authority(1)
joint−iso−itu−t(2) / joint−iso−ccitt(2)
member−body(2)
identified−organization(3)
dod(6)
internet(1)
directory(1)
mgmt(2)
experimental(3)
mib−2(1)
system(1)
interfaces(2)
ip(4)
icmp(5)
private(4)
security(5)
snmpDomains(1)
tcp(6)
udp(7)
x25(5)
transmission(10)
snmpV2(6)
snmpProxy(2)
snmp(11) ...
dot3(7) dot5(9) fddi(15) lapb(16)
snmpModules(3)
snmpMIB(1)
snmpFrameworkMIB(10)
...
Figure 5.4: ISO registration tree
Note that nodes are uniquely identified by the assigned numbers and not necessarily by their associated descriptors.
...
84
CHAPTER 5. INTERNET APPLICATION LAYER
5.2.3
Primitive ASN.1 Data Types
• The data type BOOLEAN represents the two logical values TRUE and FALSE.
• The data type INTEGER represents integral numbers. Note that there is no restriction on the precision. The
INTEGER type can also be used to represent named numbers.
• The data type BIT STRING represents a sequence of defined bits. The length of a BIT STRING does not have
to be a multiple of 8.
• The data type OCTET STRING represents a sequence of octets (bytes). The OCTET STRING is a base type for
character strings using different character sets or some other formatted strings such as GeneralizedTime or
UTCTime.
• The data type ENUMERATED represents an enumeration of values. The set of possible values must be defined
when deriving a type from the ENUMERATED primitive type.
• The data type OBJECT IDENTIFIER identifies a node in the ISO registration tree. An OBJECT IDENTIFIER
value determines the path from the root of the registration tree to a target node.
• The data type ObjectDescriptor contains a character string identifying a node in the ISO registration tree.
Note that the value of an ObjectDescriptor is not guaranteed to uniquely identify a node in the ISO registration tree.
• The data type ANY represents any valid ASN.1 data type. It is basically the union of all ASN.1 data types.
• The data type EXTERNAL represents data structures which which have not been defined with ASN.1. This type
is useful to incorporate non-ASN.1 elements in ASN.1 messages.
• The data type NULL is a place holder, typically used to indicate that a value is missing in a constructed type.
5.2.4
Constructed ASN.1 Data Types
• The data type SEQUENCE corresponds to structs in C or records in Modula. The order of the elements of a
SEQUENCE is well defined.
• The data type SET is similar to a SEQUENCE. However, the elements of a SET are not ordered.
• The data type SEQUENCE OF represents an ordered set (list) of values (which have the same type).
• The data type SET OF represents an unordered set (list) of values (which have the same type).
• The data type CHOICE is a selection type and corresponds to unions in C or variant records in Modula.
• The data type REAL is a constructed type representing floating point numbers. The mantissa and the exponent
are INTEGER values. The REAL type therefore also has unlimited precision.
5.2.5
Restrictions on ASN.1 Data Types
ASN.1 data types can be restricted to reduce the number of possible values. The precise syntax for the restrictions
depends on the data type. Typical restrictions are size restrictions and value range restrictions. Restrictions are inherited
to derived types. It is usually a good idea to introduce restricted INTEGER types that match the precision supported by
typical processor hardware. Some example type restrictions are show below:
Unsigned32
Integer32
Unsigned64
MacAddress
InetAddress
5.2.6
::=
::=
::=
::=
::=
INTEGER (0..4294967295)
INTEGER (-2147483648..2147483647)
INTEGER (0..18446744073709551615)
OCTET STRING (SIZE (6))
OCTET STRING (SIZE (4|16))
ASN.1 Tags
Data types have associated tags which can be used to identify the type of a value during transmission. A tag consists of
a tag number and a tag class. There are four different tag classes:
1. UNIVERSAL: Globally unique identification (tag assignments restricted to ASN.1 standards).
85
5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1)
2. APPLICATION: Unique identification within an application.
3. PRIVATE: Private and not universally used identification.
4. CONTEXT-SPECIFIC: Unique identification within a certain context (for example within a CHOICE).
Data type
BOOLEAN
INTEGER
BIT STRING
OCTET STRING
NULL
OBJECT IDENTIFIER
ObjectDescriptor
EXTERNAL
REAL
ENUMERATED
...
SEQUENCE, SEQUENCE OF
SET, SET OF
NumericString
PrintableString
TeletexString
VideotextString
IA5String
UTCTime
GeneralizedType
GraphicsString
VisibleString
GeneralString
CharacterString
...
Tag (decimal)
1
2
3
4
5
6
7
8
9
10
...
16
17
18
19
20
21
22
23
24
25
26
27
28
...
Table 5.2: Universal tag numbers for the core ASN.1 data types
The Table 5.2 summarizes the universal tags for the fundamental ASN.1 data types which are assigned in the ASN.1
standard.
5.2.7
Example ASN.1 Definition
The following example shows an ASN.1 module for a definition of the Simple Network Management Protocol version 1.
The definitions are semantically identical to RFC 1155 [65] and RFC 1157 [66] but differ in some details of the notation.
SNMP-VERSION-1 DEFINITIONS ::= BEGIN
-- object names and types
ObjectName ::= OBJECT IDENTIFIER
ObjectSyntax ::= CHOICE {
simple
SimpleSyntax,
86
CHAPTER 5. INTERNET APPLICATION LAYER
application-wide ApplicationSyntax
}
SimpleSyntax ::= CHOICE {
integer-value INTEGER (-2147483648..2147483647),
string-value
OCTET STRING (SIZE (0..65535)),
oid-value
OBJECT IDENTIFIER,
empty
NULL
}
ApplicationSyntax ::= CHOICE {
address-value
NetworkAddress,
counter-value
Counter32,
gauge-value
Gauge32,
timeticks-value TimeTicks,
arbitrary-value Opaque
}
NetworkAddress ::= CHOICE {
internet
IpAddress
}
IpAddress
Counter32
Gauge32
TimeTicks
Opaque
::=
::=
::=
::=
::=
[APPLICATION
[APPLICATION
[APPLICATION
[APPLICATION
[APPLICATION
0]
1]
2]
3]
4]
IMPLICIT
IMPLICIT
IMPLICIT
IMPLICIT
IMPLICIT
OCTET STRING (SIZE (4))
INTEGER (0..4294967295)
INTEGER (0..4294967295)
INTEGER (0..4294967295)
OCTET STRING
-- protocol data units
Message ::= SEQUENCE {
version
INTEGER { version-1(0) },
community OCTET STRING,
data
ANY
-- PDUs if trivial authentication
}
PDUs ::= CHOICE {
get-request
get-next-request
get-response
set-request
trap
}
GetRequest-PDU
GetNextRequest-PDU
GetResponse-PDU
SetRequest-PDU
Trap-PDU
GetRequest-PDU,
GetNextRequest-PDU,
GetResponse-PDU,
SetRequest-PDU,
Trap-PDU
::=
::=
::=
::=
::=
[0]
[1]
[2]
[3]
[4]
IMPLICIT
IMPLICIT
IMPLICIT
IMPLICIT
IMPLICIT
PDU
PDU
PDU
PDU
TrapPDU
max-bindings INTEGER ::= 2147483647
RequestID
::= INTEGER (-214783648..214783647)
5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1)
ErrorIndex
::= INTEGER (0..max-bindings)
ErrorStatus ::= INTEGER { noError(0),
tooBig(1),
noSuchName(2),
badValue(3),
readOnly(4),
genErr(5) }
VarBind ::= SEQUENCE {
name
ObjectName,
value ObjectSyntax
}
VarBindList ::= SEQUENCE (SIZE (0..max-bindings)) OF VarBind
PDU ::= SEQUENCE {
request-id
error-status
error-index
variable-bindings
}
RequestID,
ErrorStatus,
ErrorIndex,
VarBindList
TrapPDU ::= SEQUENCE {
enterprise
OBJECT IDENTIFIER,
agent-addr
NetworkAddress,
generic-trap
INTEGER { coldStart(0),
warmStart(1),
linkDown(2),
linkUp(3),
authenticationFailure(4),
egpNeighborLoss(5),
enterpriseSpecific(6) },
specific-trap
INTEGER (0..214783647),
time-stamp
TimeTicks,
variable-bindings VarBindList
}
END
87
88
CHAPTER 5. INTERNET APPLICATION LAYER
5.2.8
Basic Encoding Rules (BER)
The Basic Encoding Rules (BER) use a tag/length/value (TLV) encoding. Every data element is identified by a tag
value, the length of the data element in bytes and the data element itself. The TLV encoding enables a receiver to
identify the type of every data element which can then be verified against the type the receiver expects. The price for
this is a somewhat increased length of the encoded messages and it is finally an engineering decision whether the type
information justifies the price. There are other ASN.1 encodings like the Packed Encoding Rules (PER) which do not
generally encode type information.
Encoding of Tags
8 7 6 5 4 3 2 1
tag number (< 31)
tag type { primitive(0), constructed(1) }
tag class { universal (00), application(01), context(10), private(11) }
8 7 6 5 4 3 2 1
1 1 1 1 1
8 7 6 5 4 3 2 1
...
1
8 7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
1
0
tag number (> 31)
tag type { primitive(0), constructed(1) }
tag class { universal (00), application(01), context(10), private(11) }
Figure 5.5: BER encoding of tags
• If the tag number is less than 31, the tag number can be encoded in a single byte. This usually covers most of
the cases. If the tag number is larger, then multiple bytes must be used to encode the tag number. The highest bit
indicates whether more bytes follow and hence only seven bits are used to actually encode the number.
• The primitive / constructed bit can be used by the receiver to determine whether the value itself is a BER
encoding, which is the case for constructed types. This allows the decoder to simply recursively call the BER
decoder whenever a constructed tag has been received.
Encoding of Lengths
8 7 6 5 4 3 2 1
0
length (< 128)
8 7 6 5 4 3 2 1
1
8 7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
...
length (> 127)
number of bytes encoding length
Figure 5.6: BER encoding of the value length
• A length less than 128 bytes can be encoded in a single length byte. Larger length values must be encoded in
multiple bytes where the first byte indicates how many bytes are used to encode the length value.
89
5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1)
• The maximum length therefore is 2(127·8) − 1 = 21016 − 1 bytes.
• Besides the definite form length encoding shown here, there is also an indefinite form where the end of a value is
identified by a certain marker.
Encoding of BOOLEAN Values
• A BOOLEAN value is encoded by a single byte. The value TRUE is represented by the byte 0xff and the value
FALSE is represented by the byte 0x00.
Encoding of INTEGER Values
• An INTEGER value is encoded as a binary number. Negative numbers are encoded in two’s complement notation.
• Note that it might be necessary to encode a leading 0x00 byte for unsigned values. For example, given the
following ASN.1 type definition
Unsigned8
::= INTEGER (0..255)
the decimal value 255 will be encoded using two bytes as 0x00ff.
Encoding of BIT STRING Values
8 7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
b0 b1 b2 b3 b4 b5 b6 b7
8 7 6 5 4 3 2 1
...
bits
unused bits in last byte
Figure 5.7: BER encoding of BIT STRING values
• A BIT STRING value is encoded in a sequence of bytes that contain the named bits. The byte sequence is
prefixed with a byte which indicates the number of unused bits in the last byte of the byte sequence.
Encoding of OCTET STRING Values
• An OCTET STRING value is encoded as a sequence of bytes.
Encoding of NULL Values
• A NULL value is encoded using zero bytes. A NULL value is thus encoded by its tag and the length 0.
Encoding of OBJECT IDENTIFIER Values
8 7 6 5 4 3 2 1
0
sub−identifier (< 128)
8 7 6 5 4 3 2 1
1
...
8 7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
1
0
sub−identifier (> 127)
Figure 5.8: BER encoding of OBJECT IDENTIFIER values
90
CHAPTER 5. INTERNET APPLICATION LAYER
• An OBJECT IDENTIFIER value is encoded as a sequence of sub-identifiers.
• The first two sub-identifiers X and Y of an OBJECT IDENTIFIER value are encoded together in a single subidentifier using the formula (X · 40) + Y . (This works since X can only hold one of the three values 0, 1 or
2.)
Encoding of SEQUENCE / SEQUENCE OF / SET / SET OF Values
• Values of constructed types are encoded by encoding the elements contained in a constructed type.
Encoding of CHOICE Values
• CHOICE values are encoded by encoding the value currently present in the CHOICE. This requires that the choice
can be identifier by its tag value.
Example BER Encoding
The following example shows the BER encoding of an SNMP message which is consistent with the ASN.1 definition
shown in Section 5.2.7.
Bytes
Tag
30:1b
SEQUENCE {
02:01:00
INTEGER
04:06:70:75:62:6C:69:63
OCTET STRING
a1:0e
GetNextRequest-PDU {
02:04:36:a2:8f:07
INTEGER
02:01:00
INTEGER
02:01:00
INTEGER
30:00
SEQUENCE OF {}
}
}
Length Value
27
1
6
14
4
1
1
0
0
"public"
916623111
0
0
Remarks
• With definite length encoding, it is required that the length of the BER encoding of a value is known before the
length field can be encoded. The alternative would be to reserve some space for the length and to move encoded
values around if the reserved space was insufficient (or too big). This approach however is rather costly.
• An alternative approach is to create the BER encoding from the innermost ASN.1 element to the outermost ASN.1
element and to construct the BER encoding from the end to the beginning. This technique works fine in some
special cases but not in the general case.
• During the processing of messages, it is often required to change some fields in a message before it is passed
on. This is difficult to achieve in the general case since some changes can cause massive changes in the BER
encoding since for example length fields have to be increased.
5.2.9
Generic String Encoding Rules (GSER)
The Generic String Encoding Rules (GSER) defined in RFC 3641 [67] define a set of encoding rules that produce a
human readable UTF-8 character string encoding of ASN.1 value of any given arbitrary ASN.1 type. The encoding
does not use tags. Instead, values are prefixed with an identifier which is taken from the ASN.1 definition. The values
contained in constructed types are usually enclosed in curly braces.
5.3. SIMPLE NETWORK MANGEMENT PROTOCOL (SNMP)
5.3
91
Simple Network Mangement Protocol (SNMP)
The Simple Network Management Protocol (SNMP), developed in the late 1980s, was designed to provide standardized
access to management and control information on network devices. The introduction of SNMP was motivated by the
need to monitor, control and configure the evolving Internet.
5.3.1
Foundations
It is necessary to first introduce some fundamental concepts and the associated terminology that is being used frequently
in the network management community.
Functional Areas
The services provided by a network management system can be grouped into five categories:
1. Fault Management
Umfa”st die Fehlererkennung, die Fehlerisolation und die Fehlerbehebung.
2. Configuration Management
Umfa”st die Erzeugung und Verwaltung von Konfigurationsinformationen, die Namensverwaltung sowie Start,
Kontrolle und Beendigung von Diensten.
3. Account Management
Umfa”st die Erfassung von Verbrauchsdaten, die Verteilung und ”Uberwachung von Kontingenten sowie das
F”uhren von Verbrauchsstatistiken.
4. Performance Management
Umfa”st das Sammeln von statistischen Daten, die Ermittlung der Systemleistung und etwaige Ver”anderungen
zur Leistungsoptimierung
5. Security Management
Umfa”st die Erzeugung und Kontrolle von Sicherheitsdiensten, die Schl”usselgenerierung und -verteilung und
die Meldung und Analyse von sicherheitsrelevanten Ereignissen.
92
CHAPTER 5. INTERNET APPLICATION LAYER
5.4
Augmented Backus-Naur Form (ABNF)
Many application layer protocols (especially those running over a byte-stream oriented transport) in the Internet protocol
suite use a textual format for the protocol messages instead of binary encodings. Many of these protocols define a set
of textual commands that can be send to a server who responds with textual replies that typically include a structured
response code in a textual decimal format.
In order to reduce ambiguities when defining such textual protocols, an augmented version of the Backus-Naur Form
(BNF) has been developed. The Augmented Backus-Naur Form (ABNF) defined in RFC 2234 [68] differs from standard
BNF in several aspects such as naming rules, repetition, alternatives, order-independence, and value ranges.
5.4.1
Rule Names, Comments and Terminal Symbols
An ABNF definition consists of a set of rules (sometimes also called productions) like any other BNF. Every rule has a
name followed by an assignment operator followed by an expression consisting of terminal symbols and operators.
name = expression
The end of a rule is marked by the end of the line or by a comment. Comments start with the comment symbol ;
(semicolon) and continue to the end of the line.
• The name of a rule must start with an alphabetic character followed by a combination of alphabetics, digits and
hyphens. The case of a rule name is not significant.
• Terminal symbols are non-negative numbers. The basis of these numbers can be binary (b), decimal (d) or
hexadecimal (x). Multiple values can be concatenated by using the dot . as a value concatenation operator. It is
also possible to define ranges of consecutive values by using the hyphen - as a value range operator.
• Terminal symbols can also be defined by using literal text strings containing US ASCII characters enclosed in
double quotes. Note that these literal text strings are case-insensitive.
Some examples to demonstrate these core ABNF concepts:
CR
= %d13
CRLF = %d13.10
DIGIT = %x30-39
; ASCII carriage return code in decimal
; ASCII carriage return and linefeed code sequence
; ASCCI digits (0 - 9)
ABA
= "aba"
; ASCII string "aba" or "ABA" or "Aba" or ...
abba
= %x61.62.62.61
; ASCII string "abba"
Note that ABNF does not define a module concept and import/export mechanism which could be used to import rules
from other modules into the current module. This reflects the fact that ABNF is mostly used for documentation purposes
rather than code generation purposes.
5.4.2
Operators
The right hand side expressions of ABNF rules use a set of different operators. The ABNF operators are briefly described
in the text below.
Concatenation
The simplest operator is the concatenation operator. The operator symbol is the empty word. (Note that white space
characters are not significant in ABNF.) The rule
abba = %x61 %x62 %x62 %x61
; ASCII string "abba"
5.4. AUGMENTED BACKUS-NAUR FORM (ABNF)
93
is therefore equivalent to the rule
abba = %x61.62.62.61
; ASCII string "abba"
and the terminal symbol concatenation feature is thus just a shortcut to make definitions more compact.
Alternatives
Elements separated by alternatives operator / (forward slash) are alternatives. A rule of the form
mumble = foo / bar
results in the elements defined either by the rule foo or the rule bar. Note that the alternatives operator can be used to
emulate the value range operator by spelling out all numbers in the range:
DIGIT =
"0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"
Since it is often required to specify long lists of alternatives (and since ABNF is line oriented), there is an incremental
version of the alternatives operator which combines the alternatives operator with the assignment symbol:
DIGIT =/ "0" / "1" / "2" / "3"
DIGIT =/ "4" / "5" / "6" / "7"
DIGIT =/ "8" / "9"
Grouping
An expression can be grouped using parenthesis. A grouped expression is treated as a single element, regardless of the
internal structure of the group’s expression. It is recommended to use groups instead of relying on operator precendences
whenever there is a good chance to misread a rule.
Repetitions
Repetitions are specified using the parameterized repetition operator * (star). The full format for the operator is n*m
where n and m are optional decimal values. The value n indicates the minimum number of repetitions (defaults to 0 if
not present) and the value m indicates the maximum number of repetitions (defaults to infinity if not present).
There are two notations for special cases that appear frequently enough to justify the introduction of a special notation:
• The notation n foo indicates that foo appears exactly n times. This is equivalent to n*n foo.
• Optional elements can be written in square brackets as [foo] which is equivalent to *1 foo.
Note that abba can now be written as follows:
abba = %x61 2 %x62 %x61
Operator Precedence
The precedence of the operators from highest (binding tightest) at the top, to lowest and loosest at bottom:
Repetition
Grouping, Optional
Concatenation
Alternative
It is generally recommended that the grouping operator be used to make explicit groups in order to avoid any potential
confusion.
94
5.4.3
CHAPTER 5. INTERNET APPLICATION LAYER
Core Definitions
The following core definitions are taken from the ABNF specification in RFC 2234 [68]. They are frequently used in
other ABNF definitions.
ALPHA
=
%x41-5A / %x61-7A
; A-Z / a-z
BIT
=
"0" / "1"
CHAR
=
%x01-7F
; any 7-bit US-ASCII character, excluding NUL
CR
=
%x0D
; carriage return
CRLF
=
CR LF
; Internet standard newline
CTL
=
%x00-1F / %x7F
; controls
DIGIT
=
%x30-39
; 0-9
DQUOTE
=
%x22
; " (Double Quote)
HEXDIG
=
DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
HTAB
=
%x09
; horizontal tab
LF
=
%x0A
; linefeed
LWSP
=
*(WSP / CRLF WSP)
; linear white space (past newline)
OCTET
=
%x00-FF
; 8 bits of data
SP
=
%x20
; space
VCHAR
=
%x21-7E
; visible (printing) characters
WSP
=
SP / HTAB
; White space
5.4.4
ABNF in ABNF
The syntax of ABNF can be specified in ABNF itself. This is again copied from from RFC 2234 [68].
5.4. AUGMENTED BACKUS-NAUR FORM (ABNF)
95
rulelist
=
1*( rule / (*c-wsp c-nl) )
rule
=
rulename defined-as elements c-nl
; continues if next line starts with white space
rulename
=
ALPHA *(ALPHA / DIGIT / "-")
defined-as
=
*c-wsp ("=" / "=/") *c-wsp
; basic rules definition and incremental alternatives
elements
=
alternation *c-wsp
c-wsp
=
WSP / (c-nl WSP)
c-nl
=
comment / CRLF
; comment or newline
comment
=
";" *(WSP / VCHAR) CRLF
alternation
=
concatenation
*(*c-wsp "/" *c-wsp concatenation)
concatenation
=
repetition *(1*c-wsp repetition)
repetition
=
[repeat] element
repeat
=
1*DIGIT / (*DIGIT "*" *DIGIT)
element
=
rulename / group / option /
char-val / num-val / prose-val
group
=
"(" *c-wsp alternation *c-wsp ")"
option
=
"[" *c-wsp alternation *c-wsp "]"
char-val
=
DQUOTE *(%x20-21 / %x23-7E) DQUOTE
; quoted string of SP and VCHAR without DQUOTE
num-val
=
"%" (bin-val / dec-val / hex-val)
bin-val
=
"b" 1*BIT [ 1*("." 1*BIT) / ("-" 1*BIT) ]
; series of concatenated bit values
; or single ONEOF range
dec-val
=
"d" 1*DIGIT [ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ]
hex-val
=
"x" 1*HEXDIG [ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ]
prose-val
=
"<" *(%x20-3D / %x3F-7E) ">"
; bracketed string of SP and VCHAR without angles
; prose description, to be used as last resort
96
CHAPTER 5. INTERNET APPLICATION LAYER
5.5
Simple Mail Transfer Protocol (SMTP)
A very fundamental service in the Internet is electronic mail (email). The Simple Mail Transfer Protocol (SMTP) originally specified in RFC 821 [69] and currently documented in RFC 2821 [70] is the primary protocol for mail transport
and delivery in the Internet. SMTP usually runs over TCP and uses the well-known port number 25.
5.5.1
Grundlagen
Elektronische Post wird auch nach dem store-and-forward-Prinzip weitergeleitet bei dem jeweils ein Knoten die aktuelle
Kopie einer Nachricht hat und Verantwortung f”ur deren Weiterleitung ”ubernimmt. Bei der Weiterleitung wird zun”achst
auf dem n”achsten Knoten eine neue Kopie erzeugt. Ist der Vorgang erfolgreich beendet worden, so kann die lokale Kopie
vernichtet werden, da der n”achste Knoten jetzt f”ur die Zustellung zust”andig ist.
Eine typische Konfiguration zum Versenden, Weiterleiten und Lesen von elektronischer Post im Internet zeigt Abbildung
5.9.
sender host system
sender host system
mail
queue
user
agent
local
MTA
user
agent
SMTP
mail
queue
relay
MTA
organization A
SMTP
organization B
mail
queue
relay
MTA
user
mailbox
SMTP
user
agent
user
mailbox
receiver host system
IMAP
local
MTA
user
agent
receiver host system
Figure 5.9: Typische Konfigurationen f”ur elektronische Post
• Der mail user agent (MUA) f”uhrt den Dialog mit dem Benutzer und nimmt eine neue elektronische Nachricht
entgegen. Die Nachricht wird anschlie”send entweder dem lokalen MTA oder einem relay MTA (siehe unten)
”ubergeben.
• Ein mail transfer agent (MTA) ist f”ur die Weiterleitung von elektronischen Nachrichten durch das Internet
zust”andig. Ein relay MTA ist ein Zwischensystem das einzig zur Weiterleitung von Nachrichten im Internet
dient. Ein gateway MTA ist ein Zwischensystem das zur Weiterleitung von Nachrichten in andere Netze dient
(z.B. ISO/OSI Netzwerke auf der Basis des X.400 Protokolls).
5.5. SIMPLE MAIL TRANSFER PROTOCOL (SMTP)
97
• Am Zielsystem angekommen wird die elektronische Post in einem benutzerspezifischen Zwischenspeicher (mailbox) abgelegt. Der Zugriff auf diesen Zwischenspeicher erfolgt entweder ”uber das Dateisystem oder mit Hilfe
spezieller Protokolle wie z.B. IMAP (siehe n”achstes Kapitel).
• Bei der ”Ubertragung von elektronischer Post unterscheidet man analog zu der gelben Post zwischen einem
Umschlag (envelop) der f”ur die Weiterleitung und Zustellung einer Nachricht wichtig ist, einem Nachrichtenkopf
(header) der allgemeine Parameter einer Nachricht beschreibt und dem eigentlichen Inhalt (body).
• Der Inhalt (body) ist zun”achst nicht weiter struktuiert und auf die 7-Bit US-ASCII Zeichen reduziert. Zus”atzliche
Standards beschreiben wie der Inhalt strukturiert werden kann und insbesondere mehrteilige Dokumente mit verschiedenen Dokumenttypen realisiert werden (MIME).
5.5.2
Kommandos und Antworten
Die ”Ubertragung von elektronischer Post erfolgt mit Hilfe des Simple Mail Transfer Protocols (SMTP) [70]. Das
Protokoll definiert eine relativ kleine Menge von Kommandos die von einem SMTP Client aufgerufen und vom SMTP
Server ausgef”uhrt werden. Die Ausf”uhrung wird durch numerische Ergebniscodes best”atigt.
Identifikation eines SMTP Clients gegen”uber einen SMTP Server (HELLO)
Erweiterte Identifikation mit Anzeige unterst”utzter SMTP Optionen (EXTENDED HELLO)
Start einer Nachrichten”ubertragung (MAIL)
Identifikation eines Empf”angers (RECIPIENT)
Beginn der ”Ubertragung des Inhalts (DATA)
Abbruch einer begonnenen Transaktion (RESET)
”Uberpr”ufung einer Adresse (VERIFY)
Aufl”osung einer Mailing-Liste (EXPAND)
Hilfe ”uber die verf”ugbaren Kommandos (HELP)
Kommando ohne Wirkung (NOOP)
Beendung der Transportverbindung (QUIT)
HELO
EHLO
MAIL
RCPT
DATA
RSET
VRFY
EXPN
HELP
NOOP
QUIT
Die genaue Syntax der SMTP-Kommandos ist in ABNF definiert:
;
; Syntax of the SMTP commands:
;
helo
ehlo
mail
rcpt
=
=
=
=
data
rset
vrfy
expn
help
noop
quit
=
=
=
=
=
=
=
"HELO" SP Domain CRLF
"EHLO" SP Domain CRLF
"MAIL FROM:" ("<>" / Reverse-Path) [SP Mail-Parameters] CRLF
"RCPT TO:" ("<Postmaster@" domain ">" / "<Postmaster>" /
Forward-Path) [SP Rcpt-Parameters] CRLF
"DATA" CRLF
"RSET" CRLF
"VRFY" SP String CRLF
"EXPN" SP String CRLF
"HELP" [ SP String ] CRLF
"NOOP" [ SP String ] CRLF
"QUIT" CRLF
;
; Syntax of the SMTP command paramters:
;
Reverse-path
Forward-path
Path
A-d-l
; Note
= Path
= Path
= "<" [ A-d-l ":" ] Mailbox ">"
= At-domain *( "," A-d-l )
that this form, the so-called "source route",
98
CHAPTER 5. INTERNET APPLICATION LAYER
; MUST BE accepted, SHOULD NOT be generated, and SHOULD be
; ignored.
At-domain
= "@" domain
Mail-parameters = esmtp-param *(SP esmtp-param)
Rcpt-parameters = esmtp-param *(SP esmtp-param)
esmtp-param
= esmtp-keyword ["=" esmtp-value]
esmtp-keyword
= (ALPHA / DIGIT) *(ALPHA / DIGIT / "-")
esmtp-value
= 1*(%d33-60 / %d62-127)
; any CHAR excluding "=", SP, and control characters
Keyword = Ldh-str
Argument = Atom
Domain = (sub-domain 1*("." sub-domain)) / address-literal
sub-domain = Let-dig [Ldh-str]
address-literal = "[" IPv4-address-literal /
IPv6-address-literal /
General-address-literal "]"
Mailbox = Local-part "@" Domain
Local-part = Dot-string / Quoted-string
Dot-string = Atom *("." Atom)
Atom = 1*atext
Quoted-string = DQUOTE *qcontent DQUOTE
String = Atom / Quoted-string
;
; Syntax for IPv4/IPv6 addresses:
;
IPv4-address-literal = Snum 3("." Snum)
IPv6-address-literal = "IPv6:" IPv6-addr
General-address-literal = Standardized-tag ":" 1*dcontent
Standardized-tag = Ldh-str
; MUST be specified in a standards-track RFC
; and registered with IANA
Snum = 1*3DIGIT ; representing a decimal integer
; value in the range 0 through 255
Let-dig = ALPHA / DIGIT
Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig
IPv6-addr
IPv6-hex
IPv6-full
IPv6-comp
= IPv6-full / IPv6-comp / IPv6v4-full / IPv6v4-comp
= 1*4HEXDIG
= IPv6-hex 7(":" IPv6-hex)
= [IPv6-hex *5(":" IPv6-hex)] "::" [IPv6-hex *5(":"
IPv6-hex)]
5.5. SIMPLE MAIL TRANSFER PROTOCOL (SMTP)
99
; The "::" represents at least 2 16-bit groups of zeros
; No more than 6 groups in addition to the "::" may be
; present
IPv6v4-full = IPv6-hex 5(":" IPv6-hex) ":" IPv4-address-literal
IPv6v4-comp = [IPv6-hex *3(":" IPv6-hex)] "::"
[IPv6-hex *3(":" IPv6-hex) ":"] IPv4-address-literal
; The "::" represents at least 2 16-bit groups of zeros
; No more than 4 groups in addition to the "::" and
; IPv4-address-literal may be present
Als Reaktion auf die Kommandos schickt der Server dreistellige numerische Antwortcodes mit zus”atzlichen textuellen
Erkl”arungen, die allerdings nicht f”ur das Protokoll signifikant sind. Die Antwortcodes sind nach einem festen Muster
aufgebaut (theory of reply codes).
• Die erste Stelle des dreistelligen Antwortcodes gibt dar”uber Auskunft, um was f”ur eine Art von Antwortcode
es sich handelt:
1yz Vorl”aufige positive Antwort, wobei zur Ausf”uhrung der Aktion weitere Informationen notwendig sind
(positive preliminary reply).
2yz Endg”ultige positive Antwort ”uber die erfolgreiche Ausf”uhrung einer Aktion (positive completion reply).
3yz Zwischenzeitliche positive Antwort, wobei weitere Informationen zur Beendigung einer Aktion notwendig
sind (positive intermediate reply).
4yz Transiente negative Antwort, wobei der Fehler tempor”arer Natur ist und das Kommando wiederholt werden kann (transient negative completion reply).
5yz Endg”ultige negative Antwort, wobei eine automatische Wiederholung des Kommandos nicht sinnvoll ist
(permanent negative completion reply).
• Mit der zweiten Stelle gruppiert man Antworten in spezielle Kategorien:
x0z Syntaktische Probleme.
x1z Informelle Antworten und Statusinformationen.
x2z Antworten, die sich auf den ”Ubertragungskanal beziehen.
x3z Nicht definiert.
x4z Nicht definiert.
x5z Status des Servers im Kontext der eingeleiteten Aktionen.
• Die dritte Stelle gibt die genaue Bedeutung der Antwort in der jeweiligen Kategorie an.
Durch die Benutzung dieser dreistufigen Antwortcodes kann eine Implementation auch auf unbekannte neue Antwortcodes relativ sinnvoll reagieren. Die im SMTP-Protokoll definierten Antwortcodes sind unten nach funktionalen Kriterien
geordnet angegeben:
500
501
502
503
504
Syntax error, command unrecognized
Syntax error in parameters or arguments
Command not implemented (see section 4.2.4)
Bad sequence of commands
Command parameter not implemented
211 System status, or system help reply
214 Help message
220 <domain> Service ready
221 <domain> Service closing transmission channel
421 <domain> Service not available, closing transmission channel
250 Requested mail action okay, completed
251 User not local; will forward to <forward-path>
100
252
450
550
451
551
452
552
553
354
554
CHAPTER 5. INTERNET APPLICATION LAYER
Cannot VRFY user, but will accept message and attempt delivery
Requested mail action not taken: mailbox unavailable
Requested action not taken: mailbox unavailable
Requested action aborted: error in processing
User not local; please try <forward-path>
Requested action not taken: insufficient system storage
Requested mail action aborted: exceeded storage allocation
Requested action not taken: mailbox name not allowed
Start mail input; end with <CRLF>.<CRLF>
Transaction failed (Or, in the case of a connection-opening
response, "No SMTP service here")
5.5.3
Nachrichtenk”opfe
Das Format der Nachrichtenk”opfe ist in RFC 2822 [?] festgelegt. Die wesentliche Produktion der ABNF sieht folgenderma”sen aus:
fields
=
*(trace *resent-field) *regular-field
resend-field
resend-field
= resent-date / resent-from / resent-sender /
=/ resent-to / resent-cc / resent-bcc / resent-msg-id
regular-field
regular-field
regular-field
= orig-date / from / sender / reply-to / to / cc / bcc
=/ message-id / in-reply-to / references / subject
=/ comments / keywords
• Die resend-field Elemente werden bei der ”Ubertragung von Nachrichten von MTAs erzeugt. Sie dienen
der Fehlersuche und der Erkennung von Schleifen.
• Die ”ubrigen Felder haben die vermutlich mittlerweile allgemein bekannten Bedeutungen.
5.5.4
Multipurpose Internet Mail Extensions (MIME)
Die Multipurpose Internet Mail Extensions (MIME) [?] sind Konventionen, mit denen der Inhalt einer Nachricht strukturiert werden kann und insbesondere auch die ”Ubertragung verschiedener Dokumenttypen (inklusive bin”arer Format)
erm”oglicht wird.
Zus”atzliche Nachrichtenkopffelder
Die MIME-Spezifikationen definieren einige neue Felder f”ur den Nachrichtenkopf:
• Das Feld Mime-Version definiert die MIME-Versionsnummer (derzeit 1.0).
• Das Feld Content-Type beschreibt, welchen Medientyp die Nachricht enth”alt. Es gibt zusammengesetzte Medientypen, bei denen jedes enhaltene Dokument selbst seinen Content-Type beschreibt.
• Das Feld Content-Transfer-Encoding legt fest, wie die Daten w”ahrend der ”Ubertragung kodiert werden. Dies
ist notwendig, da SMTP klassisch nur 7-Bit ASCII verlangt und um Mehrdeutigkeiten zu vermeiden.
• Das optionale Feld em Content-Description enth”alt eine kurze Beschreibung und ist insbesondere dann sinnvoll,
wenn nicht garantiert ist, da”s ein Empf”anger in der Lage ist den Dokumenttypen auszugeben.
Medientypen
MIME unterscheidet f”unf primitive Medientypen (media types):
1. Der Medientyp (text) kann f”ur beliebigen Text verwendet werden. Der einfachste Untertyp ist plain. Beim
Medientyp (text) kann ”uber Parameter der Zeichensatz (charset) angegeben werden.
101
5.5. SIMPLE MAIL TRANSFER PROTOCOL (SMTP)
2. Der Medientyp (image) kann f”ur beliebige Bildformate verwendet werden. Typische Untertypen sind jpeg
oder png.
3. Der Medientyp (audio) steht f”ur ein Dokument, das einen Audiokanal zur Ausgabe ben”otigt.
4. Der Medientyp (video) steht f”ur Sequenzen von bewegten Bilddaten. Ein typischer Untertyp ist mpeg.
5. Der Medientyp (application) steht f”ur Datenformate, die von bestimmten Applikationen verstanden werden. Typische Vertreter sind postscript oder pdf.
Neben den primitiven Medientypen gibt es auch zusammengesetzte Medientypen:
1. Der Medientyp (multipart) wird f”ur Dokumente benutzt, die aus Teilen mit jeweils unterschiedlichen Medientypen bestehen.
2. Der Medientyp (message) kann selbst wieder eine Nachricht enthalten, wobei diese Nachricht allerdings nicht
selbst vom Typ (message) sein darf.
Bei zusammengesetzten Inhalten werden die einzelnen Teil durch Markierungen voneinander getrennt, die jeweils an
Anfang der Zeile mit zwei Minuszeichen (--) beginnen. Au”serdem mu”s im Feld Content-Type mit dem Parameter
boundary eine zus”atzliche Trennzeichenfolge festgelegt werden, die nach den zwei Minuszeichen (--) folgen mu”s.
Die letzte Markierung wird zus”atzlich durch zwei Minuszeichen (--) abgeschlossen.
From: Nathaniel Borenstein <nsb@bellcore.com>
To: Ned Freed <ned@innosoft.com>
Date: Sun, 21 Mar 1993 23:56:48 -0800 (PST)
Subject: Sample message
MIME-Version: 1.0
Content-type: multipart/mixed; boundary="simple boundary"
This is the preamble. It is to be ignored, though it
is a handy place for composition agents to include an
explanatory note to non-MIME conformant readers.
--simple boundary
This is implicitly typed plain US-ASCII text.
It does NOT end with a linebreak.
--simple boundary
Content-type: text/plain; charset=us-ascii
This is explicitly typed plain US-ASCII text.
It DOES end with a linebreak.
--simple boundary-This is the epilogue.
It is also to be ignored.
Base64 Encoding
Bei dieser Kodierung werden jeweils drei Byte (also 24 Bit) durch ein vier Zeichen dargestellt, die aus einem 6-Bit
Zeichenvorrat entnommen werden. Die sich ergebende Zeichenfolge wird so umgebrochen, da”s Zeilen niemals l”anger
sind als 76 Zeichen.
Value
0
1
2
3
4
Encoding
A
B
C
D
E
Value
17
18
19
20
21
Encoding
R
S
T
U
V
Value
34
35
36
37
38
Encoding
i
j
k
l
m
Value
51
52
53
54
55
Encoding
z
0
1
2
3
102
CHAPTER 5. INTERNET APPLICATION LAYER
5
6
7
8
9
10
11
12
13
14
15
16
F
G
H
I
J
K
L
M
N
O
P
Q
22
23
24
25
26
27
28
29
30
31
32
33
W
X
Y
Z
a
b
c
d
e
f
g
h
39
40
41
42
43
44
45
46
47
48
49
50
n
o
p
q
r
s
t
u
v
w
x
y
56
57
58
59
60
61
62
63
4
5
6
7
8
9
+
/
(pad) =
Sollten bei der Kodierung am Ende weniger als drei Byte ”ubrig bleiben, so werden F”ullbytes angeh”angt und das
besondere Zeichen = an die Kodierung angef”ugt, um die Anzahl der benutzten F”ullbytes anzuzeigen.
Offensichtlich ist ein Text nach einer Base64-Kodierung nicht mehr ohne weitere Hilfsmittel lesbar.
Quoted-Printable Encoding
Beim Quoted-Printable Encoding wird versucht, soviel wie m”oglich von dem darstellbaren Text zu erhalten. Nur Zeichen, die nicht in US-ASCII darstellbar sind, werden besonders kodiert. Das grundlegende Prinzip ist es, nicht darstellbare Zeichen durch die entsprechende Hexadezimalzahl darzustellen, wobei das Zeichen = zur Identifikation benutzt
wird (=0A=0D).
Die genauen Regel sind in Wirklichkeit relativ komplex. Siehe RFC 2045 [?] f”ur die Details.
5.6
Internet Message Access Protocol (IMAP)
Das Internet Message Access Protocol (IMAP) [?] erlaubt den Zugriff auf Nachrichten, die auf einem Server physikalisch
gespeichert werden. Dabei k”onnen verschiedenen Zwischenspeicher (folder, mailboxes) verwaltet werden und es ist
auch ein offline Betrieb mit sp”aterer Synchronisation m”oglich. (IMAP erlaubt au”serdem auch den Zugriff auf Usenet
News, indem News-Gruppen als weitere Zwischenspeicher betrachtet werden.) Folgende Funktionen werden im Detail
unterst”utzt:
• Erzeugen, l”oschen und umbenennen von Zwischenspeichern.
• Pr”ufen auf neu eingegangene elektronische Post.
• L”oschen von elektronischer Port.
• Setzen und L”oschen von Markierungen f”ur elektronische Post.
• Suchen nach bestimmter elektronischer Post.
• Selektiver Zugriff auf Attribute und Inhalte von Teilen der elektronischen Post.
IMAP basiert in der Regel auf TCP und benutzt standardm”a”sig die Portnummer 143. Da urspr”unglich die Authentifizierung durch die ”Ubertragung von Pa”sworten im Klartext geschah, wird oftmals IMAP ”uber TLS bzw. SSL verwendet. Interaktionen zwischen einem Client und einem Server sind in der Regel zeilenorientiert, wobei Zeilen durch
ein CRLF abgeschlossen werden.
5.6.1
Identifikation und Zust”ande
Nachrichten werden in verschiedenen Zwischenspeichern verwaltet. Die Zwischenspeichern werden ”uber Namen identifiziert, wobei ein hierarchischer Namensraum unterst”utzt wird.
Die Nachrichten in einem Zwischenspeichern werden ”uber Nummern identifiziert:
• Die Positionsnummer einer Nachricht message sequence number ist die relative Position einer Nachricht in einem
Zwischenspeicher (gez”ahlt wird ab 1). Die Positionsnummer einer Nachricht kann sich durch L”oschungen und
Einf”ugungen ver”andern.
5.6. INTERNET MESSAGE ACCESS PROTOCOL (IMAP)
103
• Die Identifikationsnummer einer Nachricht unique identifier identifiziert eine Nachricht in einem Zwischenspeicher unabh”angig von der Position der Nachricht im Zwischenspeicher. Die Identifikationsnummern bleiben
”uber mehrere IMAP-Sitzungen erhalten und erlauben daher die Synchronisierung beim off-line Betrieb.
Nat”urlich k”onnen auch bei der Verwendung von Identifikationsnummern Probleme auftreten, insbesondere wenn externe Programme einen Zwischenspeicher umorganisieren oder ganze Zwischenspeicher gel”oscht und neue mit demselben Namen angelegt werden. Daher gibt es einen zus”atzlichen globalen Z”ahler, der bei solchen Ereignissen inkrementiert wird und mit dem angezeigt wird, da”s die aktuellen Identifikationsnummern nicht mehr mit alten gespeicherten
Identifikationsnummern identisch sein m”ussen.
5.6.2
Zust”ande
Das IMAP-Protokoll unterscheidet verschiedene Zust”ande. Abh”angig vom jeweiligen Zustand stehen einem Client
unterschiedliche Kommandos zur Verf”ugung.
connection established and server greeting
non−authenticated
authenticated
selected
logout and connection release
Figure 5.10: IMAP Zustandsdiagramm
• Der Startzustand connection established and server greeting wird nach dem Aufbau der Transportverbindung
angenommen. Der Server schickt in diesem Zustand eine Begr”u”sungsnachricht.
• Anschlie”send findet normalerweise eine Transition in den Zustand non-authenticated statt. In diesem Zustand
sind im wesentlichen nur die Kommandos zul”assig, die zu einer Authentifikation notwendig sind.
• Nach erfolgreicher Authentifikation findet eine Transition in den Zustand authenticated statt. In diesem Zustand
kann im wesentlichen ein Zwischenspeichern zur weiteren Bearbeitung ausgew”ahlt werden.
• Im Zustand selected kann auf den Inhalt des selektierten Zwischenspeichers zugegriffen werden, und es k”onnen
”Anderungen vorgenommen werden. Man kann den selektierten Zwischenspeichers wieder freigeben (Transition
in den authenticated Zustand oder aber auch die IMAP-Sitzung beeinden.
• Im Zustand logout and connection release wird die Sitzung beendet und die Transportverbindung ordnungsgem”a”s abgebaut.
5.6.3
Kommandos
Die einzelnen IMAP-Kommandos sind immmer nur in den verschiedenen Zust”anden erlaubt. Die folgende Liste der
Kommandos ist daher nach den jeweiligen Zust”anden sortiert:
104
CHAPTER 5. INTERNET APPLICATION LAYER
Beliebiger Zustand
CAPABILITY
NOOP
LOGOUT
Liefert eine Liste der Faehigkeiten des Servers
Leeres Kommando (kann f”ur Statusaktualisierungen benutzt werden)
Beendigung der IMAP-Sitzung
Zustand non-authenticated
AUTHENTICATE
LOGIN
Auswahl eines Authentifizierungsverfahrens
Triviale Authentifizierung mit einem Klartext Pa”swort
Zustand authenticated
SELECT
EXAMINE
CREATE
DELETE
RENAME
SUBSCRIBE
UNSUBSCRIBE
LIST
LSUB
STATUS
APPEND
Auswahl eines Zwischenspeichers (read-write)
Auswahl eines Zwischenspeichers (read-only)
Anlegen eines neuen Zwischenspeichers
L”oschen eines neuen Zwischenspeichers
Umbenennen eines Zwischenspeichers
Eintrag eines Zwischenspeichers in die Liste der aktiven Zwischenspeicher
Austragung eines Zwischenspeichers aus der Liste der aktiven Zwischenspeicher
Auflisten der Namen der Zwischenspeicher
Auflisten der Namen der aktiven Zwischenspeicher
Statusabfrage von einem Zwischenspeicher
Anf”ugen von Daten an einen Zwischenspeicher
Zustand selected
CHECK
CLOSE
EXPUNGE
SEARCH
FETCH
STORE
COPY
UID
5.6.4
Anlegen einer Sicherungskopie des Zwischenspeiches
Schlie”sen des aktuellen Zwischenspeichers
L”oschen aller Nachrichten, die zum L”oschen markiert sind.
Suchen von Nachrichten, die bestimmte Kriterien erf”ullen
Lesen von Daten einer Nachricht aus dem Zwischenspeicher
”Andern von Daten einer Nachricht in dem Zwischenspeicher
Kopieren von Nachrichten an das Ende eines Zwischenspeichers
Tagging
IMAP unterst”utzt nebenl”aufige Operationen auf dem Server. Ein Client kann also mehrere Kommandos absetzen, die
dann vom Server asynchron ausgef”uhrt werden. Ein Client versieht seine Kommandos daher mit eindeutigen Tags, um
sp”ater Antworten den verschiedenen Kommandos zuordnen zu k”onnen. Die Syntax eines Kommandos ist damit:
command
= tag SPACE (command_any / command_auth /
command_nonauth / command_select) CRLF
Der Server antwortet auf Kommandos mit Antworten. Dabei werden wiederum drei Arten von Antwortzeilen unterschieden:
1. Antwortzeilen, die weitere Informationen zur Bearbeitung eines Kommandos anfordern, werden durch ein +
markiert.
2. Antwortzeilen, die eine Statusmeldung beinhalten und nicht das Ende einer Kommandobearbeitung implizieren,
werden durch ein * markiert.
3. Alle anderen Antworten beginnen mit dem vom Client gesetzten Tag und zeigen die (erfolgreiche oder erfolglose)
Beendigung der Bearbeitung eines Kommandos an.
Die R”uckmeldungen ”uber die erfolgreiche/erfolglose Bearbeitung erfolgt durch Schl”usselworte (OK, NO, BAD) und
nicht wie beim SMTP durch strukturierte Antwortcodes.
5.6. INTERNET MESSAGE ACCESS PROTOCOL (IMAP)
5.6.5
105
Nachrichtenformat
Das IMAP-Nachrichtenformat selbst ist wiederum in ABNF beschrieben. Einen Ausschnitt der ABNF-Definitionen fuer
die grundlegensten Elemente ist hier angegeben. F”ur die recht umfangreiche vollst”andige Version sei jedoch auch RFC
2060 [?] verwiesen.
tag
= 1*<any ATOM_CHAR except "+">
;
; top-level productions for IMAP commands:
;
command
= tag SPACE (command_any / command_auth /
command_nonauth / command_select) CRLF
;; Modal based on state
command_any
= "CAPABILITY" / "LOGOUT" / "NOOP" / x_command
;; Valid in all states
command_auth
= append / create / delete / examine / list / lsub /
rename / select / status / subscribe / unsubscribe
;; Valid only in Authenticated or Selected state
command_nonauth = login / authenticate
;; Valid only when in Non-Authenticated state
command_select
= "CHECK" / "CLOSE" / "EXPUNGE" /
copy / fetch / store / uid / search
;; Valid only when in Selected state
;
; top-level productions for IMAP responses:
;
response
= *(continue_req / response_data) response_done
continue_req
= "+" SPACE (resp_text / base64)
response_data
= "*" SPACE (resp_cond_state / resp_cond_bye /
mailbox_data / message_data / capability_data) CRLF
response_done
= response_tagged / response_fatal
response_fatal
= "*" SPACE resp_cond_bye CRLF
;; Server closes connection immediately
response_tagged = tag SPACE resp_cond_state CRLF
106
CHAPTER 5. INTERNET APPLICATION LAYER
5.7
File Transfer Protocol (FTP)
The File Transfer Protocol (FTP) defined in RFC 959 [71] is one of the very old and elementary protocols of the Internet.
It can be used to transfer files between nodes connected to the Internet. The protocol uses a separate TCP connection
for each data transfer. This approach sidesteps any problems about marking the end of files. On the other hand, using
separate TCP connections is not very efficient for small files, due to the connection establishment/tear-down overhead
and the fact that flow control and congestion state needs to be build up for each new connection.
user
interface
control
process
file
system
data transfer
process
FTP Server
commands
replies
control
process
data
data transfer
process
file
system
FTP Client
Figure 5.11: FTP interaction model
Remarks:
• The control connection uses a text-based line-oriented protocol which is similar to SMTP. The client sends commands which are processed by the server. The server sends responses using three digit response codes.
• A separate TCP connection is established for each data transfer. The connection can be initiated either from the
client of the server. If the data transfer connection is initiated by the server, then the client’s port number must
be conveyed first to the server. If the data transfer connection is initiated by the client, then the well-known port
number 20 is used. The well-known port number for the control connection is 21.
• FTP allows to resume a data transmission that did not complete using a special restart mechanism.
• FTP can be used to initiate a data transfer between two remote systems. However, this feature of FTP can result
in some interesting security problems. For more details, consult RFC 2577 [72].
5.8. HYPERTEXT TRANSFER PROTOCOL (HTTP)
5.8
107
Hypertext Transfer Protocol (HTTP)
The Hypertext Transfer Protocol (HTTP) is defined in RFC 2616 [73] and one of the core building blocks of the World
Wide Web. HTTP is a simple request/response protocol primarily used to exchange documents between clients (browsers)
and servers. The HTTP protocol runs on top of TCP and it uses the well known port number 80. HTTP utilizes MIME
conventions in order to distinguish different media types.
Documents are identified using Uniform Resource Identifier (URIs) as defined in RFC 2396 [74]. HTTP provides a fixed
set of methods that can be applied to documents identified by a URI. The current version of HTTP supports the methods
shown in Table 5.3.
Method
OPTIONS
GET
HEAD
POST
PUT
DELETE
TRACE
CONNECT
Description
Request information about the communication options available
Retrieve whatever information is identified by a URI
Retrieve only the meta-information which is identified by a URI
Annotate an existing resource or pass data to a data-handling process
Store information under the supplied URI
Delete the resource identified by the URI
Application-layer loopback of request messages for testing purposes
Initiate a tunnel such as a TLS or SSL tunnel
Table 5.3: HTTP 1.1 methods
The principal structure of HTTP messages is described in the following simplified ABNF. A Message is either a
Request or a Response. They syntactically only differ in the first line, which is either a Request-Line or a
Status-Line.
Message
=
Request / Response
Request
Response
=
=
Request-Line *(message-header CRLF) CRLF [ message-body ]
Status-Line *(message-header CRLF) CRLF [ message-body ]
Request-Line =
Method SP Request-URI SP HTTP-Version CRLF
Status-Line
=
HTTP-Version SP Status-Code SP Reason-Phrase CRLF
Method
Method
Method
= "OPTIONS" / "GET" / "HEAD" / "POST"
=/ "PUT" / "DELETE" / "TRACE" / "CONNECT"
=/ Extension-Method
Extension-Method = token
Request-URI
HTTP-Version
Status-Code
Reason-Phrase
=
=
=
=
"*" | absoluteURI | abs_path | authority
"HTTP" "/" 1*DIGIT "." 1*DIGIT
3DIGIT
*<TEXT, excluding CR, LF>
message-header
field-name
field-value
field-content
=
=
=
=
field-name ":" [ field-value ]
token
*( field-content / LWS )
<the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
108
5.8.1
CHAPTER 5. INTERNET APPLICATION LAYER
Persistent Connections and Pipelining
HTTP 1.1 supports persistent connections. A client can establish a connection to a server and use it to send multiple
Request messages. Earlier version of HTTP allowed only a single Request/Response exchange over a single
connection, which is of course rather expensive. To use a connection for multiple requests, it is important to detect the
end of a message body (document). HTTP relies on the MIME Content-Length header field for this purpose. If for
some reason the server does not know the length before starting to send the response (typically the case for dynamic pages
that are constructed on the fly), then the server may choose to close the connection to indicate the end of the message
body.
HTTP 1.1 also allows clients to make multiple requests without waiting for each response (pipelining), which can significantly reduce latency. Web pages typically consist of an HTML document which has links to many small icons and
other elements. Being able to retrieve all these referenced elements over a single connection in a pipelined mode clearly
significantly reduces the number of TCP round trip message exchanges.
5.8.2
Caching and Proxies
Probably the most interesting and also most complex part of HTTP is its support for proxies and caching. Proxies are
entities that exist between the client and the server and which basically relays requests and responses. The HTTP 1.1
specifically describes how proxies are supposed to handle requests and it defines message headers that can be used by a
client to learn about the proxies between the client and the server or to control how many proxies may exist in the path
between the client and the server.
Some proxies and clients also maintain caches where copies of documents are stored in local storage space to speedup
future accesses to these cached documents. The HTTP protocol allows a client to interrogate the server to determine
whether the document has changed or not. Caching is a key to the efficient operation of the Web. HTTP allows servers
to control whether and how a page can be cached as well as its lifetime. Furthermore, browsers can force a request to
bypass caches and obtain a fresh copy from a server.
Note that not all problems related to HTTP proxies and caches have been solved. A good list of issues can be found in
RFC 3143 [75].
5.8.3
Negotiation
HTTP supports header fields that allow to negotiate capabilities and preferences. There are two different mechanisms:
1. Server-driven negotiation begins with a request from a client (a browser). The client indicates a list of its preferences. The server then decides how to best respond to the request.
2. Client-driven negotiation requires two requests. The client first asks the server what is available and then decides
which concrete request to send to the server.
Negotation can be used to select different document formats, different transfer encodings, different languages or different
character sets. In most cases, server-driven negotiation is used since this is much more efficient. The following is an
example of typical negotiation header lines:
Accept: text/xml, application/xml, text/html;q=0.9, text/plain;q=0.8
Accept-Language: de, en;q=0.5
Accept-Encoding: gzip, deflate, compress;q=0.9
Accept-Charset: ISO-8859-1, utf8;q=0.66, *;q=0.33
The last line says that the client prefers ISO-8859-1 encoding (with preference 1), UTF8 encoding with preference 0.66
and any other encoding with preference 0.33.
5.8. HYPERTEXT TRANSFER PROTOCOL (HTTP)
5.8.4
109
Conditional Requests
HTTP allows a client to make a request conditional by including headers that qualify the conditions under which the
request should be honored. Conditional requests can be used to avoid unnecessary requests.
This can best be explained by looking at a concrete example. Suppose a client has cached a certain document and wants
to check whether the document in the cache needs to be updated. The client can now send a request which includes the
following header line:
If-Modified-Since: Wed, 26 Nov 2003 23:21:08 +0100
The server checks whether the document was changed after the date indicated by the header line and only process the
request if this is the case.
5.8.5
Delta Encoding
Delta Encodings are defined in RFC 3229 [76] as a mechanism to request only the changes relative to a specific version.
The motivation behind this proposal is the observation that many documents only change slightly over time and that it is
often much more efficient to just retrieve the changes instead of retrieving the whole document.
5.8.6
HTTP as a Substrate
HTTP has been so successful that it is often used as a substrate for other application protocols. Examples are the Internet
Printing Protocol (IPP) or the Simple Object Access Protocol (SOAP). However, HTTP was not designed for this purpose
and there are some pitfalls. For those interested in the details, consult RFC 3205 [77].
110
CHAPTER 5. INTERNET APPLICATION LAYER
Appendix A
Packet Capturing
Networks sometimes do not function as expected and in some situations it is useful to capture the relevant frames /
packets for analysis. It is also often useful to count certain frames/packets in order to obtain usage statistics.
There are special programs such as tcpdump, ethereal or ngrep which can be used to capture and analyze packets
/ frames. A performance critical aspect is the interface between the network interface (hardware), the operating system
kernel and the user space programs. To achieve good performance, all the components have to play well together:
• Captured packets should be filtered or aggregated as early as possible.
• The copying of packets is an expensive operation and should be avoided wherever possible.
• The number of system calls and the associated context switches should be minimized.
An approach to address this problem is to discard unimportant frames / packets as early as possible, ideally before they
are copied from the device driver’s memory to the operating system kernel memory.
A.1
BSD Packet Filter (BPF)
The BSD Packet Filter (BPF) [78] is based on a simple but powerful control-flow machine which is implemented within
the Unix kernel. It allows for an effective filtering of frames / packets without expensive copying operations. The BPF
is programmed by user space applications which transform human readable filter expressions into a sequence of BPF
instructions and install them into the kernel. Packets which match the BPF filter are first collected within the kernel and
moved to the user space application in batches in order to reduce the number and costs for the required system calls.
The BPF machine has the following components:
• An accumulator for all calculations.
• An index register (x) which allows to access data relative to a certain position.
• Memory for storing intermediate results.
• All registers and memory locations are 32-bit wide.
The instruction set of the BPF machine uses a fixed format which can be interpreted efficiently. Below are some example
BPF programs. For more details, see [78].
Example 1 Select all Ethernet frames which contain IPv4 packets (ip):
(000)
(001)
(002)
(003)
ldh
jeq
ret
ret
[12]
#0x800
#96
#0
jt 2
;
jf 3 ;
;
;
load ethernet type field
compare with 0x800
return snaplen
filter failed
111
112
APPENDIX A. PACKET CAPTURING
user
kernel
buffer
buffer
buffer
protocol
stack
filter
filter
filter
BPF
link−level
driver
link−level
driver
link−level
driver
kernel
network
Figure A.1: BPF within the Unix kernel
Example 2 Select all Ethernet frames which contain IPv4 packets which do not originate are not from the networks
128.3.112/24 and 128.3.254/24 (ip and not src net 128.3.112/24 and not src net 128.3.254/24):
(000)
(001)
(002)
(003)
(004)
(005)
(006)
(007)
A.2
ldh
jeq
ld
and
jeq
jeq
ret
ret
[12]
#0x800
jt 2
[26]
#0xffffff00
#0x80037000 jt 7
#0x8003fe00 jt 7
#96
#0
;
jf 7 ;
;
;
jf 5 ;
jf 6 ;
;
;
load ethernet type field
compare with 0x800
load ipv4 address field
mask the network part
compare with 128.3.112.0
compare with 128.3.254.0
return snaplen
filter failed
libpcap
The libpcap1 C library provides the following functionality:
• A portable API that hides the differences of packet filter implementations in different operating systems.
• A compiler which translates human readable filter expressions into BPF programs.
• An interpreter for BPF programs which can be used to filter (previously captured) packets in user space.
• Functions for writing captured packets to files and for reading previously captured packets from files.
The usage of the libpcap API can best be illustrated by an example program. The following C source code implements
a program which opens a file containing previously captured packets, optionally installs a filter and then processes the
filtered packets in a callback function.
A more detailed description of the libpcap API can be found in the pcap(3) manual page. The syntax of the
supported filter expression is documented in the tcpdump(1) manual page.
A.3
jpcap
Programmers who prefer the Java language can use the Java packet jpcap2 which allows to process captured packets
with Java programs. The jpcap package is implemented as a Java wrapper around the libpcap API.
1
2
http://www.tcpdump.org/
http://www.sf.net/projects/jpcap
A.3. JPCAP
113
The C program from the previous section can be written in Java as follows:
For a more detailed description of the jpcap API, see the jpcap Java documentation. Before using the jpcap API,
please check whether Java is fast enough for processing data rates from high speed networks.
114
APPENDIX A. PACKET CAPTURING
Appendix B
Sockets
The socket application programming interface (API) was developed at the University of California in Berkeley as part
of the work on the BSD Unix system. The socket interface is a generic interface for interprocess communication using
message passing. Sockets are abstract communication endpoints with a rather small number of associated function calls.
The socket API distinguishes different types of sockets:
• A stream socket (SOCK STREAM) is a bidirectional reliable communication endpoint. Data written to a local
stream socket can be read from a remote stream socket without having to worry about transmission errors, fragmentation or reordering that might occur in the underlying network. While the order of the byte stream is not
changed, the data block boundaries are not preserved.
• A datagram socket (SOCK DGRAM) is a bidirectional unreliable communication endpoint which allows to exchange datagrams. Datagrams send over the local datagram socket may not be received by the remote datagram
socket or they may be received multiple times. Furthermore, the ordering of the datagrams can change during the
transmission. Note that datagram boundaries are preserved.
• A raw socket (SOCK RAW) is a communication endpoint which allows to receive and send network or interface
layer datagrams.
• A reliable delivered message socket (SOCK RDM) is similar to a datagram socket but provides in addition reliable
datagram delivery.
• A sequenced packet socket (SOCK SEQPACKET) is similar to a stream socket but retains data block boundaries.
B.1
Socket Addresses
It is necessary to assign a name (an address) to a communication endpoint before it can be used. The socket API supports
different name spaces with different address formats. The generic data structure for addresses (struct sockaddr) is
defined as follows:
#include <sys/socket.h>
struct sockaddr
uint8_t
sa_family_t
char
};
{
sa_len
sa_family;
sa_data[...];
struct sockaddr_storage {
uint8_t
ss_len;
sa_family_t ss_family;
char
padding[...];
};
/* address length (BSD) */
/* address family */
/* data of some size */
/* address length (BSD) */
/* address family */
/* padding of some size */
115
116
APPENDIX B. SOCKETS
Newer BSD systems support the (sa len) field in the generic and the specific socket addresses which was not present
in the older socket API. Other systems usually do not have this (sa len) member (although it is generally a good idea
to have this member). The currently most important name spaces are the name spaces for the Internet and a name space
for local communication:
IPv4 Socket Addresses
Sockets that represent IPv4 communication endpoints use the address family AF INET and the protocol family PF INET.
IPv4 transport addresses are represented by the structure struct sockaddr in:
#include <sys/socket.h>
typedef ... sa_family_t;
#include <netinet/in.h>
typedef ... in_port_t;
struct in_addr {
uint8_t s_addr[4];
};
/* IPv4 address */
struct sockaddr_in {
uint8_t
sin_len;
sa_family_t sin_family;
in_port_t
sin_port;
struct in_addr sin_addr;
};
/*
/*
/*
/*
address length (BSD) */
address family */
transport layer port */
IPv4 address */
IPv6 Socket Addresses
Sockets that represent IPv6 communication endpoints use the address family AF INET6 and and the protocol family
PF INET6. IPv6 transport addresses are represented by the structure struct sockaddr in6:
#include <sys/socket.h>
typedef ... sa_family_t;
#include <netinet/in.h>
typedef ... in_port_t;
struct in6_addr {
uint8_t s6_addr[16];
};
/* IPv6 address */
struct sockaddr_in6 {
uint8_t
sin6_len;
sa_family_t sin6_family;
in_port_t
sin6_port;
uint32_t
sin6_flowinfo;
struct in6_addr sin6_addr;
uint32_t
sin6_scope_id;
};
/*
/*
/*
/*
/*
/*
address length (BSD) */
address family */
transport layer port */
flow information */
IPv6 address */
scope identifier */
117
B.2. COMMUNICATION KINDS
Local Socket Addresses
Sockets that represent local communication endpoints use the address family AF LOCAL and the protocol family PF LOCAL.
Local sockets are also widely known under the name Unix sockets with the address family AF UNIX and the protocol
family PF UNIX. Local socket addresses are represented by the structure struct sockaddr un:
#include <sys/socket.h>
typedef ... sa_family_t;
#include <sys/un.h>
struct sockaddr_un {
uint8_t
sun_len;
sa_family_t sun_family;
char
sun_path[108];
};
B.2
/* address length (BSD) */
/* address family */
/* xxx Is 108 POSIX ? */
Communication Kinds
The socket API allows to realize different communication kinds in application programs. The most important two styles
are connection-less datagram communication and connection-oriented data stream communication.
socket()
bind()
socket()
bind()
recvfrom()
data
sendto()
data
sendto()
recvfrom()
close()
Figure B.1: Connection-less datagram communication
Figure B.1 shows how a server and a client make use of the socket primitives to provide and realize a connection-less
datagram application protocol. After creating and binding a local socket, the processes use the recvfrom() and the
sendto() primitives to receive and send datagrams.
Figure B.2 shows how a server and a client make use of the socket primitives to provide and realize a connection-oriented
application protocol. The server creates a listening local socket which is used to accept incoming connections. Once
a connection has been accepted, a new local file descriptor is returned which can be used to read() or write()
data. The close() function is called to close the connection. On the client side, the connect() function is used
to connected the local socket to a remote (server) socket. When the connect() function returns successfully, normal
read() or write() functions can be used to exchange data. The close() function is again called to close the
connection.
118
APPENDIX B. SOCKETS
socket()
bind()
listen()
socket()
connection setup
accept()
data
read()
data
write()
connection release
close()
connect()
write()
read()
close()
Figure B.2: Connection-oriented data stream communication
B.3
Socket API Overview
The following definitions summarize the socket API. For a more detailed description, please consult the corresponding
Unix manual pages.
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>
#define
#define
#define
#define
#define
SOCK_STREAM
SOCK_DGRAM
SCOK_RAW
SOCK_RDM
SOCK_SEQPACKET
...
...
...
...
...
#define AF_LOCAL ...
#define AF_INET ...
#define AF_INET6 ...
#define PF_LOCAL ...
#define PF_INET ...
#define PF_INET6 ...
int socket(int domain, int type, int protocol);
int bind(int socket, struct sockaddr *addr,
socklen_t addrlen);
int connect(int socket, struct sockaddr *addr,
socklen_t addrlen);
int listen(int socket, int backlog);
119
B.4. NAME RESOLUTION
int accept(int socket, struct sockaddr *addr,
socklen_t *addrlen);
ssize_t write(int socket, void *buf, size_t count);
int send(int socket, void *msg, size_t len, int flags);
int sendto(int socket, void *msg, size_t len, int flags,
struct sockaddr *addr, socklen_t addrlen);
ssize_t read(int socket, void *buf, size_t count);
int recv(int socket, void *buf, size_t len, int flags);
int recvfrom(int socket, void *buf, size_t len, int flags,
struct sockaddr *addr, socklen_t *addrlen);
int shutdown(int socket, int how);
int close(int socket);
int getsockopt(int socket, int level, int optname,
void *optval, socklen_t *optlen);
int setsockopt(int socket, int level, int optname,
void *optval, socklen_t optlen);
int getsockname(int socket, struct sockaddr *addr,
socklen_t *addrlen);
int getpeername(int socket, struct sockaddr *addr,
socklen_t *addrlen);
B.4
Name Resolution
Numeric addresses are usually hard to memorize for humans. It thus useful to introduce more human friendly symbolic
names. The Internet protocols use the Domain Name System (DNS) to map symbolic names to Internet addresses. Note
that DNS supports IPv4 as well as IPv6 addresses. Furthermore, there is usually also a locally defined mapping of
well-known port numbers to symbolic names (e.g., port number 80 has the well-known symbolic names http or www).
The name to address mapping is supported by the functions getaddrinfo() and getnameinfo() which are described below. Many older programs still use the functions gethostbyname() and gethostbyaddr() which have
been deprecated.
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#define AI_PASSIVE
...
#define AI_CANONNAME
...
#define AI_NUMERICHOST ...
struct addrinfo {
int
int
int
int
size_t
struct sockaddr
char
struct addrinfo
};
ai_flags;
ai_family;
ai_socktype;
ai_protocol;
ai_addrlen;
*ai_addr;
*ai_canonname;
*ai_next;
int getaddrinfo(const char *node,
const char *service,
const struct addrinfo *hints,
120
APPENDIX B. SOCKETS
struct addrinfo **res);
void freeaddrinfo(struct addrinfo *res);
const char *gai_strerror(int errcode);
The mapping of names to addresses is realized by the function getaddrinfo(). This function has three input parameters (node, service, hints) and returns a pointer to a list of struct addrinfo elements. This list must be
released by calling freeaddrinfo() if it is not used anymore. In case of an error, getaddrinfo() returns a value
unequal to 0 which can be passed to gai strerror() in order to get a human readable error description.
One of the arguments node and service can be NULL thus requesting only a name resolution of the other element.
The name resolution process can be further controlled by passing some hints to the function. Hints can be used, for
example, to request addresses of a certain address family or socket type.
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#define
#define
#define
#define
#define
#define
NI_NOFQDN
NI_NUMERICHOST
NI_NAMEREQD
NI_NUMERICSERV
NI_NUMERICSCOPE
NI_DGRAM
...
...
...
...
...
...
int getnameinfo(const struct sockaddr *sa,
socklen_t salen,
char *host, size_t hostlen,
char *serv, size_t servlen,
int flags);
const char *gai_strerror(int errcode);
The inverse mapping of addresses to symbolic names is supported by the function getnameinfo(). The first two
parameters (sa, salen) are input parameters. The result of the mapping is a host name and a service name which
is written to the memory location host with the length hostlen and serv with the length servlen. Additional
flags can be passed to the mapping function in order to control the details of the mapping process.
Example 3 The following source code implements a client for a simple connection-oriented protocol which retrieves the
date and time from a server.
Example 4 The following source code implements a server for a simple connection-oriented protocol which retrieves
the date and time from a server.
Example 5 The following source code implements a client for a simple connection-less protocol which retrieves the date
and time from a server.
Example 6 The following source code implements a server for a simple connection-less protocol which retrieves the
date and time from a server.
B.5
Multiplexing
The examples discussed so far all had the property that the server or the client could block (freeze) in case of some
communication errors. For example, the connection oriented server block incoming requests until a client has been
served. One approach to address this deficiency is to use threads which of course requires thread-safe libraries. The other
alternative is to avoid calling blocking functions by first checking whether a socket can be read or written.
B.5. MULTIPLEXING
121
#include <sys/select.h>
typedef ... fd_set;
FD_ZERO(fd_set *set);
FD_SET(int fd, fd_set *set);
FD_CLR(int fd, fd_set *set);
FD_ISSET(int fd, fd_set *set);
int select(int n, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, struct timeval *timeout);
int pselect(int n, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, struct timespec *timeout,
sigset_t sigmask);
The functions select() and pselect() can be used to test whether socket descriptors are “ready” so that a subsequent socket library call does not block. The select() call distinguishes three sets of socket descriptors:
1. The set readfds contains descriptors which will be watched to see if a subsequent read operation will not block.
2. The set writefds contains descriptors which will be watched to see if a subsequent write operation will not
block.
3. The set exceptfds contains the descriptors which will be watched for excetions.
The macros FD ZERO(), FD SET(), FD CLR() and FD ISSET() can be used to manipulate the sets. The timeout
is an upper bound on the amount of time elapsed before the select function returns. The parameter n contains the
highest-numbered file descriptor in any of the three sets plus 1.
The function pselect() allows the correct handling of situations where a program wants to wait for socket descriptors
as well as software signals.
Example 7 The following source code combines the connection-less server with the connection-oriented server. The
main loop uses the select() function to wait for incoming requests.
122
APPENDIX B. SOCKETS
Bibliography
[1] F. Halsall. Data Communications, Computer Networks and Open Systems. Addison-Wesley, 4 edition, 1996.
[2] A. S. Tanenbaum. Computer Networks. Prentice Hall, 4 edition, 2002.
[3] W. Stallings. Data and Computer Communications. Prentice Hall, 6 edition, 2000.
[4] D. E. Comer. Internetworking with TCP/IP: Principles, Protocols, and Architectures. Prentice Hall, 4 edition,
2000.
[5] C. Huitema. Routing in the Internet. Prentice Hall, 2 edition, 1999.
[6] W. R. Stevens. TCP/IP Illustrated, Volume 1: The Protocols. Addison Wesley, 1994.
[7] J. F. Kurose and K. W. Ross. Computer Networking: A Top-Down Approach Featuring the Internet. AddisonWesley, 3 edition, 2004.
[8] R. Hinden and S. Deering. Internet Protocol Version 6 (IPv6) Addressing Architecture. RFC 3513, Nokia, Cisco
Systems, April 2003.
[9] R. Elz. A Compact Representation of IPv6 Addresses. RFC 1924, University of Melbourne, April 1996.
[10] D. Waitzman. A Standard for the Transmission of IP Datagrams on Avian Carriers. RFC 1149, BBN STC, April
1990.
[11] P. Hoffman and S. Bradner. Defining the IETF. RFC 3233, Internet Mail Consortium, Harvard University, February
2002.
[12] S. Bradner. The Internet Standards Process – Revision 3. RFC 2026, Harvard University, October 1996.
[13] R. M. Metcalfe and D. R. Boggs. Ethernet: Distributed packet switching for local computer networks. Communications of the ACM, 19(5):395–404, July 1976.
[14] ANSI/IEEE. Local Area Networks: CSMA/CD, Std 802.3, 1988.
[15] B. Carpenter. Architectural Principles of the Internet. RFC 1958, IAB, June 1996.
[16] R. Bush and D. Meyer. Some Internet Architectural Guidelines and Philosophy. RFC 3439, December 2002.
[17] S. Deering and R. Hinden. Internet Protocol, Version 6 (IPv6) Specification. RFC 2460, Cisco, Nokia, December
1998.
[18] J. Postel. Internet Protocol. RFC 791, ISI, September 1981.
[19] IANA. Special-Use IPv4 Addresses. RFC 3330, Internet Assigned Numbers Authority, September 2002.
[20] Y. Rekhter, B. Moskowitz, D. Karrenberg, G. J. deGroot, and E. Lear. Address Allocation for Private Internets.
RFC 1918, Cisco Systems, Chrysler Corp., RIPE NCC, Silicon Graphics, Inc., February 1996.
[21] K. Nichols, S. Blake, F. Baker, and D. Black. Definition of the Differentiated Services Field (DS Field) in the IPv4
and IPv6 Headers. RFC 2474, Cisco Systems, Torrent Networking Technologies, EMC Corporation, December
1998.
[22] K. Ramakrishnan, S. Floyd, and D. Black. The Addition of Explicit Congestion Notification (ECN) to IP. RFC
3168, TeraOptic Networks, ACIRI, EMC, September 2001.
[23] D. Grossman. New Terminology and Clarifications for Diffserv. RFC 3260, Motorola, Inc., April 2002.
[24] F. Baker. Requirements for IP Version 4 Routers. RFC 1812, Cisco Systems, June 1995.
[25] G. Trotter. Terminology for Forwarding Information Base (FIB) based Router Performance. RFC 3222, Agilent
Technologies, December 2001.
123
124
BIBLIOGRAPHY
[26] M. A. Ruiz-Sánchez, E. W. Biersack, and W. Dabbous. Survey and Taxonomy of IP Address Lookup Algorithms.
IEEE Network, pages 8–23, March 2000.
[27] J. Postel. Internet Control Message Protocol. RFC 792, ISI, September 1981.
[28] C. Kent and J. Mogul. Fragmentation Considered Harmful. In Proc. SIGCOMM ’87 Workshop on Frontiers in
Computer Communications Technology, August 1987.
[29] J. Mogul and S. Deering. Path MTU Discovery. RFC 1191, DECWRL, Stanford University, November 1990.
[30] C. Hornig. A Standard for the Transmission of IP Datagrams over Ethernet Networks. RFC 894, Symbolics
Cambridge Research Center, April 1984.
[31] D. C. Plummer. An Ethernet Address Resolution Protocol. RFC 826, MIT, November 1982.
[32] R. Finlayson, T. Mann, J. Mogul, and M. Theimer. A Reverse Address Resolution Protocol. RFC 903, Stanford
University, June 1984.
[33] R. Droms. Dynamic Host Configuration Protocol. RFC 2131, Bucknell University, March 1997.
[34] S. Alexander and R. Droms. DHCP Options and BOOTP Vendor Extensions. RFC 2132, Silicon Graphics, Bucknell
University, March 1997.
[35] R. Droms and W. Arbaugh. Authentication for DHCP Messages. RFC 3118, Cisco Systems, University of Maryland, June 2001.
[36] R. Gilligan and E. Nordmark. Transition Mechanisms for IPv6 Hosts and Routers. RFC 2893, FreeGate Corp., Sun
Microsystems, August 2000.
[37] S. Kent and R. Atkinson. IP Authentication Header. RFC 2402, BBN Corporation, At Home Network, November
1998.
[38] S. Kent and R. Atkinson. IP Encapsulating Security Payload (ESP). RFC 2406, BBN Corporation, At Home
Network, November 1998.
[39] A. Conta and S. Deering. Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6)
Specification. RFC 2463, Lucent, Cisco Systems, December 1998.
[40] M. Crawford. Transmission of IPv6 Packets over Ethernet Networks. RFC 2464, Fermilab, December 1998.
[41] T. Narten, E. Nordmark, and W. Simpson. Neighbor Discovery for IP Version 6 (IPv6). RFC 2461, IBM, Sun
Microsystems, Daydreamer, December 1998.
[42] G. Malkin. RIP Version 2. RFC 2453, Bay Networks, November 1998.
[43] F. Baker and R. Atkinson. RIP-2 MD5 Authentication. RFC 2082, Cisco Systems, January 1997.
[44] J. Moy. OSPF Version 2. RFC 2328, Ascend Communications, April 1998.
[45] Y. Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4). RFC 1771, IBM, Cisco, March 1995.
[46] G. Huston. The BGP Routing Table. The Internet Journal, 4(1), March 2001.
[47] J. Postel. User Datagram Protocol. RFC 768, ISI, August 1980.
[48] J. Postel. Transmission Control Protocol. RFC 793, ISI, September 1981.
[49] M. Allman, V. Paxson, and W. Stevens. TCP Congestion Control. RFC 2581, NASA Glenn/Sterling Software,
ACIRI/ICSI, April 1999.
[50] M. Allman, S. Floyd, and C. Partridge. Increasing TCP’s Initial Window. RFC 3390, BBN/NASA GRC, ICIR,
BBN Technologies, October 2002.
[51] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxson. Stream Control Transmission Protocol. RFC 2960, Motorola, Cisco, Siemens, Nortel Networks, Ericsson,
Telcordia, UCLA, ACIRI, October 2000.
[52] L. Ong and J. Yoakum. An Introduction to the Stream Control Transmission Protocol (SCTP). RFC 3286, Ciena
Corporation, Nortel Networks, May 2002.
[53] J. Stone, R. Stewart, and D. Otis. Stream Control Transmission Protocol (SCTP) Checksum Change. RFC 3309,
Stanford, Cisco Systems, SANlight, September 2002.
[54] P. Mockapetris. Domain Names - Concepts and Facilities. RFC 1034, ISI, November 1987.
[55] P. Mockapetris. Domain Names - Implementation and Specification. RFC 1035, ISI, November 1987.
[56] D. Eastlake. Domain Name System Security Extensions. RFC 2535, IBM, March 1999.
BIBLIOGRAPHY
125
[57] P. Faltstrom, P. Hoffman, and A. Costello. Internationalizing Domain Names in Applications (IDNA). RFC 3490,
Cisco, IMC & VPNC, UC Berkeley, March 2003.
[58] P. Hoffman and M. Blanchet. Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN). RFC
3491, IMC & VPNC, Viagenie, March 2003.
[59] A. Costello. Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications
(IDNA). RFC 3492, UC Berkeley, March 2003.
[60] S. Thomson and C. Huitema. DNS Extensions to support IP version 6. RFC 1886, Bellcore, INRIA, December
1995.
[61] S. Thomson, C. Huitema, V. Ksinant, and M. Souissi. DNS Extensions to Support IP Version 6. RFC 3596, Cisco,
Microsoft, 6WIND, AFNIC, October 2003.
[62] ITU. Information technology - Abstract Syntax Notation One (ASN.1): Specification of basic notation. Recommendation ITU-T X.680, International Telecommunication Union, December 1997.
[63] ITU. Information technology - ASN.1 encoding rules: Specification of Basic Encoding Rules (BER), Canonical
Encoding Rules (CER) and Distinguished Encoding Rules (DER). Recommendation ITU-T X.690, International
Telecommunication Union, December 1997.
[64] D. Steedman. Abstract Syntax Notation One (ASN.1): The Tutorial and Reference. Technology Appraisals, 1990.
[65] M. Rose and K. McCloghrie. Structure and Identification of Management Information for TCP/IP-based Internets.
RFC 1155, Performance Systems International, Hughes LAN Systems, May 1990.
[66] J. Case, M. Fedor, M. Schoffstall, and J. Davin. A Simple Network Management Protocol. RFC 1157, SNMP
Research, PSI, MIT, May 1990.
[67] S. Legg. Generic String Encoding Rules (GSER) for ASN.1 Types. RFC 3641, Adacel Technologies, October
2003.
[68] D. Crocker and P. Overell. Augmented BNF for Syntax Specifications: ABNF. RFC 2234, Internet Mail Consortium, Demon Internet Ltd., November 1997.
[69] J. Postel. Simple Mail Transfer Protocol. RFC 821, ISI, August 1982.
[70] J. Klensin. Simple Mail Transfer Protocol. RFC 2821, AT&T Laboratories, April 2001.
[71] J. Postel and J. Reynolds. File Transfer Protocol (FTP). RFC 959, ISI, October 1985.
[72] M. Allman and S. Ostermann. FTP Security Considerations. RFC 2577, NASA Glenn/Sterling Software, Ohio
University, May 1999.
[73] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext Transfer Protocol
– HTTP/1.1. RFC 2616, UC Irvine, Compaq/W3C, Compaq, W3C/MIT, Xerox, Microsoft, W3C/MIT, June 1999.
[74] T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax. RFC 2396,
MIT/LCS, U.C. Irvine, Xerox Corporation, August 1998.
[75] I. Cooper and J. Dilley. Known HTTP Proxy/Caching Problems. RFC 3143, Equinix, Akamai Technologies, June
2001.
[76] J. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, Y. Goland, A. van Hoff, and D. Hellerstein. Delta encoding
in HTTP. RFC 3229, Compaq WRL, AT&T, Univ. of Saarbruecken, Marimba, ERS/USDA, January 2002.
[77] K. Moore. On the use of HTTP as a Substrate. RFC 3205, University of Tennessee, February 2002.
[78] S. McCanne and V. Jacobson. The BSD Packet Filter: A New Architecture for User-level Packet Capture. In Proc.
Usenix Winter Conference, January 1993.
Download