Algorithms and Data Structures for IP Lookup, Packet

advertisement
Dissertation zur Erlangung des Doktorgrades
der Fakultät für Angewandte Wissenschaften der
Albert-Ludwigs-Universität Freiburg
Algorithms and Data Structures
for IP Lookup, Packet
Classification and Conflict
Detection
Christine Maindorfer
Betreuer: Prof. Dr. Thomas Ottmann
Dekan der Fakultät für Angewandte Wissenschaften:
Prof. Dr. Hans Zappe
Betreuer: Prof. Dr. Thomas Ottmann
Zweitgutachterin: Prof. Dr. Susanne Albers
Tag der Disputation: 2.3.2009
Zusammenfassung
Die Hauptaufgabe eines Internet-Routers besteht in der Weiterleitung von Paketen.
Um den nächsten Router auf dem Weg zum Ziel zu bestimmen, wird der Header,
welcher u.a. die Zieladresse enthält, eines jeden Datenpaketes inspiziert und gegen
eine Routertabelle abgeglichen. Im Falle, dass mehrere Präfixe in der Routertabelle
mit der Zieladresse übereinstimmen, wird in der Regel eine Strategie gewählt, die
als “Longest Prefix Matching” bekannt ist. Hierbei wird von allen möglichen Aktionen diejenige ausgewählt, die durch das längste mit der Adresse übereinstimmende
Präfix festgelegt ist. Zur Lösung dieses sogenannten IP-Lookup-Problems sind
zahlreiche Algorithmen und Datenstrukturen vorgeschlagen worden.
Änderungen in der Netzwerktopologie aufgrund von physikalischen Verbindungsausfällen, der Hinzunahme von neuen Routern oder Verbindungen führen zu Aktualisierungen in den Routertabellen. Da die Performanz der IP-Lookup-Einheit
einen entscheidenden Einfluss auf die Gesamtperformanz des Internets hat, ist
es entscheidend, dass IP-Lookup sowie Aktualisierungen so schnell wie möglich
durchgeführt werden. Um diese Operationen zu beschleunigen, sollten Routertabellen so implementiert werden, dass Lookup und Aktualisierungen gleichzeitig
ausgeführt werden können. Um zu sichern, dass auf Suchbäumen basierte dynamische Routertabellen nicht durch Updates degenerieren, unterlegt man diese mit
einer balancierten Suchbaumklasse. Relaxierte Balancierung ist ein gebräuchliches
Konzept im Design von nebenläufig implementierten Suchbäumen. Hierbei werden die Balanceoperationen ggf. auf Zeitpunkte verschoben, in denen keine Suchprozesse im gleichen Teil des Baumes durchgeführt werden.
Der erste Teil dieser Dissertation untersucht die Hypothese, dass ein relaxiert
balanciertes Schema für dynamische Routertabellen besser geeignet ist als ein
Schema, welches strikte Balancierung verwendet. Dazu schlagen wir den relaxiert balancierten min-augmentierten Bereichssuchbaum vor und vergleichen diesen
mit der strikt balancierten Variante im Rahmen eines Benchmarks, wobei echte
IPv4 Routerdaten verwendet werden.
Um eine Plausibilitätsbetrachtung anstellen zu können, welche die Korrektheit der
verschiedenen Lockingstrategien untermauert, wird darüberhinaus eine interaktive
Visualisierung des relaxiert balancierten min-augmentierten Bereichssuchbaums
präsentiert.
Des Weiteren stellen IP-Router “Policy basierte” Routing-Mechanismen (PBR) zur
Verfügung, welche das bestehende, auf Zieladressen basierende Routing, ergänzen.
PBR bietet unter anderem die Möglichkeit, “Quality of Service” (QoS) sowie Netzwerksicherheitsbestimmungen, sogenannte Firewalls, zu unterstützen. Um PBR
zur Verfügung stellen zu können, müssen Router mehrere Paketfelder wie zum
Beispiel die Quell- und Zieladresse, Port und Protokoll inspizieren, um Pakete
in sogenannte “Flows” zu klassifizieren. Dies erfordert eine gegebene Menge von
vordefinierten d-dimensionalen Filtern zu durchsuchen, wobei die Anzahl der zu inspizierenden Paketfelder der Dimension d entspricht. Geometrisch gesprochen werden Filter durch d-dimensionale Hyperrechtecke und Pakete durch d-dimensionale
Punkte repräsentiert. Paketklassifikation bedeutet nun, für ein weiterzuleitendes Paket das am besten passende Hyperrechteck zu finden, welches den Punkt
enthält.
Der R-Baum ist eine mehrdimensionale Indexstruktur zur dynamischen Verwaltung von räumlichen Daten, welcher Punkt- und Enthaltenseinanfragen unterstützt.
Der R-Baum und dessen Varianten wurden bis dato noch nicht auf deren Eignung
für das Paketklassifizierungsproblem hin untersucht.
Im zweiten Teil werden wir eruieren, ob der weitverbreitete R*-Baum zur Lösung
dieses Problems geeignet ist. Dazu wird dieser mit zwei repräsentativen Klassifizierungsalgorithmen im Rahmen eines Benchmark-Tests verglichen. Die Simulationsumgebung ist statisch, d.h. es finden keine Filteraktualisierungen statt.
Die Mehrheit der vorgeschlagenen Klassifizierungsalgorithmen unterstützt inkrementelle Aktualisierungen nicht auf eine effiziente Weise. Erweist sich der R*Baum als geeignet, ist dieses Benchmark ein Sprungbrett für eine Untersuchung
im dynamischen Fall.
Falls mehrere Filter auf ein weiterzuleitendes Paket anwendbar sind, wird ein sogenannter Tiebreaker verwendet, um den am besten passenden Filter zu bestimmen.
Übliche Tiebreaker sind (i) wähle den “ersten passenden” Filter, (ii) den Filter mit
höchster Priorität und (iii) den “spezifischsten” Filter. Es wurde festgestellt, dass
nicht jede Policy durch die Vergabe von Prioritäten durchgesetzt werden kann und
vorgeschlagen, in jenen Fällen den “spezifischsten Filter” Tiebreaker anzuwenden.
Jedoch ist dieser Tiebreaker nur realisierbar, wenn für jedes Paket der spezifischste Filter wohldefiniert ist. Ist dies nicht der Fall, sagt man, die Filtermenge sei
widersprüchlich.
Im letzten Teil dieser Dissertation schlagen wir einen Algorithmus zur Konfliktaufdeckung und Beseitigung für den statischen eindimensionalen Fall vor, wobei
jeder Filter durch ein beliebiges Intervall spezifiziert ist. Weiterhin zeigen wir, dass
wenn zur Lösung dieses Problems eine partiell persistente Datenstruktur verwendet
wird, diese Struktur auch IP-Lookup unterstützt.
Abstract
The major task of an Internet router is to forward packets towards their final
destination. When a router receives a packet from an input link interface, it uses
its destination address to look up a routing database. The result of the lookup
provides the next hop address to which the packet is forwarded. Routers only need
to determine the next best hop toward a destination, not the complete path to
the destination. Changes in network topologies due to physical link failures, link
repairs or the addition of new routers and links lead to updates in the routing
database. Since the performance of the lookup device plays a crucial role in the
overall performance of the Internet, it is important that lookup and route update
operations are performed as fast as possible. To accelerate lookup and update
operations, routing tables must be implemented in a way that they can be queried
and modified concurrently by several processes. Relaxed balancing has become a
commonly used concept in the design of concurrent search tree algorithms.
The first part investigates the hypothesis that a relaxed balancing scheme is better
suited for search-tree based dynamic IP router tables than a scheme that utilizes
strict balancing. To this end we propose the relaxed balanced min-augmented
range tree and benchmark it with the strictly balanced variant using real IPv4
routing data. Further, in order to carry out a plausibility consideration, which
corroborates the correctness of the proposed locking schemes, we present an interactive visualization of the relaxed balanced min-augmented range tree.
Enhanced IP routers further provide policy-based routing (PBR) mechanisms,
complementing the existing destination-based routing scheme. PBR provides a
mechanism for implementing Quality of Service (QoS), i.e., certain kinds of traffic
receive differentiated, preferential service. For example, time-sensitive traffic such
as voice should receive higher QoS guarantees than less time-sensitive traffic such
as file transfers or e-mail. Besides QoS, PBR further provides a mechanism to enforce network security policies. PBR requires network routers to examine multiple
fields of the packet header in order to classify them into “flows”. Flow identification entails searching a table of predefined filters to identify the appropriate flow
based on criteria including source and destination IP address, ports, and protocol type. Geometrically speaking, classifying an arriving packet is equivalent to
finding the best matching hyperrectangle among all hyperrectangles that contain
the point representing the packet.
The R-tree and its variants have not been experimentally evaluated and benchmarked for their eligibility for the packet classification problem.
In the second part we investigate if the popular R*-tree is suited for packet classification. For this purpose we will benchmark the R*-tree with two representative
classification algorithms in a static environment. Most of the proposed classification algorithms do not support fast incremental updates. If the R*-tree shows to be
suitable in a static classification scenario, then the benchmark is a stepping stone
for benchmarking R*-trees in a dynamic classification environment, i.e., where
classification is intermixed with filter updates.
If a packet matches multiple filters, a tiebreaker is used in order to determine
the best matching filter among all matching filters. Common tiebreakers are: (i)
First matching filter, (ii) highest priority filter (HPF) and (iii) most specific filter
(MSTB). However, not every policy can be enforced by assigning priorities. In
these cases, MSTB should be used instead. Yet, the most specific tiebreaker is only
feasible if for each packet p the most specific filter that applies to p is well-defined.
If this is not the case, the filter set is said to be conflicting.
In the last part of the thesis we propose a conflict detection and resolution algorithm for static one-dimensional range tables, i.e., where each filter is specified by
an arbitrary range. Further, we show that by making use of partial persistence,
the structure can also be employed for IP lookup.
Acknowlegments
The work presented in this thesis was carried out during my time as a research
assistant at the institute of computer science at the University of Freiburg.
Inexpressible thanks go to my research advisor, Prof. Dr. Thomas Ottmann. I
sincerely appreciate his tremendous patience, especially early in my studies when
I was a rather naı̈ve researcher, his time for countless discussions and diligent
mentorship. His keen sense of untraveled paths in the world of research is truly
inspiring.
It has also been an honor to have Prof. Dr. Susanne Albers as my second referee.
I highly appreciate her time and effort to appraise my thesis.
I also would like to thank Prof. Dr. Christian Schindelhauer and Prof. Dr. Wolfram Burgard for serving on my committee.
I gratefully acknowledge the support of my work through a grant from the German
Research Foundation (DFG) within the program “Algorithmik großer und komplexer Netzwerke”.
I would like to thank my current and former collaborators in our research group:
Frank Dal-Ri, Dr. Tobias Lauer, Khaireel Mohamed, Robin Pomplun, Christoph
Hermann, Martina Welte, Dr. Wolfgang Hürst, Dr. Peter Leven and Elisabeth
Patschke, for all their advice and for the enjoyable time.
I have also had the opportunity to advise or co-advise several bachelor and diploma
theses. Specifically, I would like to thank Thorsten Seddig and Waldemar Wittmann,
whose support significantly enhanced my research. Further, I am particularly
grateful to my friend and post-graduate assistant Bettina Bär whose effort and
dedication greatly promoted a substantial part of this work.
I would like to thank my friend Anita Willmann for her friendship since childhood
and for accompanying me unto faraway places for special events (“I booked a seat
right next to yours”). I also thank her and Dr. Tobias Lauer for proofreading
parts of this thesis.
I would like to offer my most heartfelt thanks to my husband and best friend, Ingo
Daniel Maindorfer, for his love, companionship and undying support. Life would
not be what it is without him. I would like to thank my parents for their love
and for their consummate support and encouragement throughout my life. It is
impossible to list all that they have done and still do for me. I thank my sister
Arlette Patricia for the joy we share since she entered this planet. I love you all
and I will never stop thanking God for you.
Contents
1 Introduction
1.1 Geometric interpretation of IP lookup and packet classification . . .
1.2 Objectives of this dissertation . . . . . . . . . . . . . . . . . . . . .
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
I
IP Address Lookup
1
5
7
9
11
2 Introduction
13
2.1 Organization of part I . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Another geometric interpretation of IP lookup . . . . . . . . . . . . 14
2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Min-augmented Range Trees
3.1 Longest matching prefix . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Update operations . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Comparison with priority search trees and priority search pennants
23
23
24
25
4 Relaxed Balancing
4.1 Red-black trees . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Insertions . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Deletions . . . . . . . . . . . . . . . . . . . . . . .
4.2 Relaxed balanced red-black trees . . . . . . . . . . . . . .
4.2.1 Interleaving updates . . . . . . . . . . . . . . . . .
4.2.2 Concurrent handling of rebalancing transformations
27
27
28
28
29
30
31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Relaxed Min-Augmented Range Trees
33
5.1 Longest matching prefix . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Interleaving updates . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Concurrency Control
37
6.1 The deadlock problem . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Strictly balanced trees . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2.1 Concurrent MART . . . . . . . . . . . . . . . . . . . . . . . 42
ii
CONTENTS
6.3
Relaxed balanced trees . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.1 Concurrent RMART . . . . . . . . . . . . . . . . . . . . . . 48
7 Interactive Visualization of the
7.1 Application framework . . . .
7.2 Architecture . . . . . . . . . .
7.3 The graphical user interface .
RMART
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
8 Experimental Results
8.1 The MRT format . . . . . . . . . . . . . . .
8.2 Flow characteristics in internetwork traffic .
8.2.1 Locality in internetwork traffic . . . .
8.2.2 Statistical properties of flows . . . .
8.3 Generation of sequences of operations . . . .
8.4 Test setup . . . . . . . . . . . . . . . . . . .
8.5 Comparison of the RMART and the MART
8.5.1 Solely lookups . . . . . . . . . . . . .
8.5.2 Solely updates . . . . . . . . . . . . .
8.5.3 Various update frequencies . . . . . .
8.5.4 Résumé of experimental results . . .
8.6 Benchmark on Sun Fire X4600 . . . . . . . .
8.7 Implementing the RMART in hardware . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
52
52
53
61
62
63
64
64
65
67
68
69
71
72
74
76
78
9 Conclusions and Future Directions
81
II
83
Packet Classification
10 Introduction
85
10.1 Goal of this part . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.2 Organization of part II . . . . . . . . . . . . . . . . . . . . . . . . . 86
10.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
11 R-trees
11.1 The original R-tree . . . . . . . . .
11.1.1 Query processing . . . . . .
11.1.2 Query optimization criteria
11.1.3 Updates . . . . . . . . . . .
11.2 R-tree variants . . . . . . . . . . .
11.2.1 The R+-tree . . . . . . . . .
11.2.2 The R*-tree . . . . . . . . .
11.2.3 Compact R-trees . . . . . .
11.2.4 cR-trees . . . . . . . . . . .
11.2.5 Static versions of R-trees . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
95
96
97
98
98
99
99
99
99
100
iii
CONTENTS
12 Packet Classification using R-trees
12.1 Performance evaluation . . . . . . . . . . . . .
12.1.1 Filter sets . . . . . . . . . . . . . . . .
12.1.2 Simulation results of R*-tree . . . . . .
12.1.3 Benchmark of R*-tree and HyperCuts
12.1.4 Benchmark of R*-tree and RFC . . . .
12.2 Conclusions and future directions . . . . . . .
III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Conflict Detection and Resolution
13 Introduction
13.1 Organization of this part . . . . . . . . . . . .
13.2 Preliminaries . . . . . . . . . . . . . . . . . .
13.3 Related work . . . . . . . . . . . . . . . . . .
13.3.1 Online conflict detection and resolution
13.3.2 Offline conflict detection and resolution
101
. 101
. 102
. 103
. 105
. 111
. 113
115
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
117
118
118
123
123
126
14 Detecting and Resolving Conflicts
129
14.1 The output-sensitive solution to the one - dimensional offline problem129
14.1.1 Status structures . . . . . . . . . . . . . . . . . . . . . . . . 130
14.1.2 Handling event points . . . . . . . . . . . . . . . . . . . . . 131
14.1.3 The sweepline environment . . . . . . . . . . . . . . . . . . . 132
14.1.4 Running Slab-Detect . . . . . . . . . . . . . . . . . . . . . . 132
14.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 133
14.3 Adapting Slab-Detect under the HPF rule . . . . . . . . . . . . . . 135
14.3.1 Status structures . . . . . . . . . . . . . . . . . . . . . . . . 137
14.3.2 Handling event points . . . . . . . . . . . . . . . . . . . . . 137
14.4 Setting up IP lookup with Slab-Detect . . . . . . . . . . . . . . . . 137
14.5 Contributions and concluding remarks . . . . . . . . . . . . . . . . 139
IV
Summary of Contributions
Bibliography
141
145
Chapter 1
Introduction
The Internet is a global web of autonomous networks, a “network of networks”,
interconnected with routers. Each network, or Autonomous System (AS), is managed by its own authority and contains its own internal network of routers and
subnetworks. Network “reachability” information is exchanged via routing protocols. A dynamic routing protocol adjusts to changing network topologies, which
are indicated in update messages that are exchanged between routers. If a link goes
down or becomes congested, the routing protocol makes sure that other routers
know about the change. From these updates a router constructs a forwarding table which contains a set of network addresses and a reference to the interface that
leads to that network. Routers in different autonomous systems use the Border
Gateway Protocol (BGP) to exchange network reachability information. Routing
among autonomous systems is called exterior routing or “interdomain routing”.
After applying local policies, a BGP router selects a single best route and advertises it to other routers within the same AS. Interior routing is referred to as
“intradomain routing”. The primary interior routing protocol in use today is Open
Shortest Path First (OSPF).
Information travels in packets across a network that consists of multiple paths to
a destination. A packet is conceptually divided into two pieces: the header and
the payload. The header contains addressing and control fields, while the payload
carries the actual data to be sent over the internetwork. When a packet arrives
at a router, the router consults its forwarding table to determine the best way to
forward that packet, i.e., the next hop address. However, routers only need to
determine the next best hop toward a destination, not the complete path to the
destination.
The TCP/IP protocol suite provides the internetwork addressing scheme and transport scheme for router-connected networks [1]. In order to uniquely identify Internet hosts, each host is assigned an IP address. An IP address is a unique number
that contains two parts: a network address and a host address. The network
address is used when forwarding packets across interconnected networks. It defines the destination network, and routers along the way know how to forward the
2
Chapter 1. Introduction
packet based on the network address. When the packet arrives at the destination network, the host portion of the IP address identifies the destination host.
Currently, the vast majority of Internet traffic utilizes Internet Protocol version 4
(IPv4). IPv4 assigns 32-bit addresses to Internet hosts, which limits the address
space to 232 possible unique addresses. With the rapid growth of the Internet
through the 1990’s, there was a rapid reduction in the number of free IP addresses
available under IPv4 [2]. The IETF settled on IPv6, recommended in January
1995 in RFC 1752, sometimes also referred to as the “Next Generation Internet
Protocol”, or IPng [2]. IPv6 assigns 128-bit addresses to Internet hosts. The date
predicted where the Regional Internet Registry IPv4 unallocated address pool will
be exhausted is November 2011 [3]. A related prediction is the exhaustion of the
Internet Assigned Numbers Authority IPv4 unallocated address pool by the end of
2010 [3]. Currently, we are in a transition phase, i.e., IPv4 and IPv6 coexist on the
same machines (technically often referred to as “dual stack”) and are transmitted
over the same network links.
While computers work with IP addresses as 32 (128)-bit binary values, humans
normally use the dotted-decimal notation. A binary IPv4 address and its dotteddecimal equivalent are, e.g., 11000000.10101000.00001010.00000110 = 192.168.10.6.
Note that the 32-bit address is divided into four eight-bit fields called octets. Each
octet in an IP address ranges in value from a minimum of 0 to a maximum of 255.
Therefore, the full range of IP addresses is from 0.0.0.0 through 255.255.255.255.
Historically, the IP address space was divided into three main classes, where each
class had a fixed size network address: Class A (16777214 hosts), Class B (65534
hosts), and Class C (254 hosts) [1]. The class was determined by the most significant bits of an IP address. Most organizations which required a larger address
space than Class C were allocated a block of Class B addresses, even though their
network consumed only a fraction of the addresses. During the 1980s, the need for
more flexible addressing schemes became increasingly apparent. This led to the
gradual development of subnetting and Classless Inter-Domain Routing (CIDR).
CIDR was introduced in 1993 and is the latest refinement to the way IP addresses
are interpreted [4]. CIDR allows routing protocols to aggregate network addresses
into single routing table entries which reduces the amount of packet forwarding
information stored by each router. These aggregations, commonly called CIDR
blocks, share an initial sequence of bits in the binary representation of their IP
addresses. IPv4 CIDR blocks are identified using a syntax similar to that of IPv4
addresses: a four-part dotted-decimal address, followed by a slash, then a number from 0 to 32: A.B.C.D/k. The dotted decimal portion is interpreted, like an
IPv4 address, as a 32-bit binary number that has been broken into four octets.
The number following the slash is the prefix length, the number of shared initial
bits, i.e., counting from the most significant bit. For example, in the CIDR block
206.13.01.48/25, the “/25” indicates the first 25 bits are used to identify the unique
network leaving the remaining bits, which are commonly represented by a wildcard
3
Figure 1.1: Example of Longest Prefix Matching for a 7-bit destination address;
11011* is the longest matching prefix; the corresponding next hop is seven.
’*’, to identify the specific host. An IP address is part of a CIDR block, and is
said to match the CIDR prefix if the initial k bits of the address and the CIDR
prefix are the same.
The task of resolving the next hop for an incoming packet is referred to as IP
lookup. A route lookup requires finding the longest matching prefix among all
matching prefixes for the given destination address. An example of Longest Prefix
Matching (LPM) for a 7-bit search key is provided in Figure 1.1.
The Transmission Control Protocol provides a reliable transmission service for IP
packets [1]. While TCP provides these reliable services, it depends on IP to deliver packets. Reliable data delivery services are critical for applications such as file
transfers, database services, transaction processing, and other mission-critical applications in which every packet must be delivered-guaranteed. TCP uses sequence
numbers so that the destination can reorder packets and determine if a packet is
missing. It further uses a cumulative acknowledgment scheme, where the receiver
sends an acknowledgment signifying that it has received all data preceding the
acknowledged sequence number. Sequence numbers and acknowledgments make it
possible for TCP to provide an in-order delivery of packets at the destination host,
discard duplicate packets, and retransmit lost packets. TCP identifies applications
using 16-bit port numbers carried in the transport header which is appended to
the IP header. The type of transport protocol carried in the IP header determines
the format of the transport protocol header following the IP header in the packet.
Best-effort delivery describes a network service in which the network does not pro-
4
Chapter 1. Introduction
vide any guarantees that data is delivered or that a user is given a guaranteed quality of service level or a certain priority [1]. In a best-effort network all users obtain
best-effort service, meaning that they obtain unspecified variable bit rate and delivery time, depending on the current traffic load. Note that TCP does not reserve
any resources in advance, and does not provide any guarantees regarding quality of
service, for example bit rate. In that sense, it can be considered as best-effort communication. Conventional IP routers only provide best-effort service. Enhanced IP
routers further provide policy-based routing (PBR) mechanisms, complementing
the existing destination-based routing scheme [5]. PBR provides a mechanism for
expressing and implementing routing of data packets based on the policies defined
by the network administrators. For example, mission-critical and time-sensitive
traffic such as voice should receive higher qualitiy of service (QoS) guarantees than
less time-sensitive traffic such as file transfers or e-mail. Besides QoS, PBR further
provides a mechanism to enforce network security policies.
PBR requires network routers to examine multiple fields of the packet header in
order to categorize them into “flows”. A flow may be thought of as the communication traffic generated by a specific application traveling between a specific set
of hosts or subnetworks. Hence, flows are considered to be sequences of packets
with an n-tuple of common values such as source and destination addresses. The
process of categorizing packets into flows in an Internet router is called packet
classification. The function of the packet classification system is to check packet
headers against a set of predefined filters. The relevant packet header fields include
source and destination IP addresses, source and destination port numbers, protocol and others. Formally, a filter set consists of a finite set of n filters, f1 , f2 . . . fn .
Each filter is a combination of d header field specifications, h1 , h2 . . . hd . Each
header field specifies one of four kinds of matches: exact match, prefix match,
range match, or masked-bitmap match. A packet p is said to match a filter fi if
and only if the header fields, h1 , h2 . . . hd , match the corresponding fields in fi in
the specified way. Each filter fi has an associated action that determines how a
packet p is handled if p matches fi .
A collection of filters is called a classifier. An example classifier is shown in Table
1.1.
The header of an arriving packet may satisfy the conditions of more than one
filter. In this case the filter with the highest priority among all the matching filters
is commonly used. Using the example classifier in Table 1.1, an incoming packet
p with header (10 . . ., 0011 . . ., TCP, 1) matches f2 and f3 . Assuming that f2 has
higher priority than f3 , f2 will be returned.
5
1.1. Geometric interpretation of IP lookup and packet classification
Filter
f1
f2
f3
f4
f5
f6
f7
SA
11*
100111*
1011*
10*
0*
0*
*
DA
*
*
0011*
011*
11*
100111*
*
Prot
TCP
TCP
*
UDP
TCP
UDP
TCP
DP
[3:15]
[1:1]
[1:15]
[3:3]
[0:1]
[0:15]
[3:5]
P
1
2
3
4
5
6
7
Table 1.1: Example classifier of seven filters classifying on four fields (source and
destination address, protocol and destination port). Each filter has an associated
priority tag P ; wildcard fields are denoted with *.
Figure 1.2: The longest matching prefix corresponds to the most specific interval
of all intervals that contain the query point.
1.1
Geometric interpretation of IP lookup and
packet classification
In geometric terms, a prefix b1 . . . bk ∗ can be mapped to an interval in the form
of [b1 . . . bk 0 . . . 0, b1 . . . bk 1 . . . 1]. For example, if the prefix length is limited by
5, 0010∗ is represented by [4, 5]. An incoming packet with destination address
b1 , . . . , bw can be mapped to a point p ∈ U , where U = [0, 2w − 1] and w = 32 for
IPv4 and w = 128 for IPv6.
The longest matching prefix corresponds to the most specific interval of all intervals
that contain the query point. An interval f1 is more specific than an interval f2 iff
f1 ⊂ f2 . If two intervals partially overlap, neither is more specific than the other.
Figure 1.2 shows an example.
A set of intervals specified by prefixes has the property that any two intervals are
either disjoint or one is completely contained in the other. Hence, for each query
point p, there is a unique defined most specific interval that contains p, provided
that the default filter spanning the entire universe U is included in the set.
6
Chapter 1. Introduction
We have seen that a prefix represents a contiguous interval on the number line.
Similarly, a two-dimensional filter is represented by an axes-parallel rectangle in
the two-dimensional Euclidean space. A filter f = (prs ∗, prd ∗), where prs is a i-bit
prefix and prd is a j-bit prefix, is represented by a 2w−i × 2w−j rectangle, where
w is the maximum prefix length. Generalizing, a filter in d dimensions represents
a d-dimensional hyperrectangle in d-dimensional space. A classifier is therefore
a collection of rectangles, each of which is labeled with a priority. An incoming
packet header represents a point with coordinates equal to the values of the header
fields corresponding to the dimensions. For example, Figure 1.3 shows the geometric representation of the classifier in Table 1.1 for the source and destination
address fields and w = 10. Filter f7 covers the entire space 210 × 210 .
Given this geometric representation, classifying an arriving packet is equivalent to
finding the highest priority rectangle among all rectangles that contain the point
representing the packet. For example, the point p in Figure 1.3 is contained in
the filters with priorities five and seven. If lower values represent higher priorities,
then filter f5 will be returned.
Figure 1.3: The geometric representation of the 10-bit source and destination
address fields of the classifier in Table 1.1. Point p represents a packet to be
classified.
1.2. Objectives of this dissertation
1.2
Objectives of this dissertation
With a rapid increase in the data transmission link rates and an immense continuous growth in the Internet traffic, efficient lookup and classification techniques
are essential for meeting performance demands. The speed and scalability of the
IP lookup or packet classification scheme employed largely determines the performance of the router, and hence the Internet as a whole. Therefore, both problems
have received much attention in the research community.
Due to the transient nature of network links, routing protocols allow the routers
to continually exchange information about the state of the network. There are
two strategies to handle table updates. The first employs two copies of the table.
Lookups are done on the working table, updates are performed on a shadow table.
Periodically, the shadow table replaces the working table. In this mode of operation, packets may be forwarded wrongly. The amount of misdirections depends on
the periodicity with which the working table is replaced by an updated shadow.
Further, additional memory is required for the shadow table. The second strategy
performs updates directly on the working table. Here, no packet is improperly forwarded. However, IP lookup may be delayed while a preceding update completes.
To accelerate lookup and update processes operating on a single forwarding table,
these tables must be implemented in a way that they can be queried and modified
concurrently by several processes. If implemented in a concurrent environment
there must be a way to prevent simultaneous reading and writing of the same
parts of the data structure. A common strategy is to lock the critical parts. In
order to allow a high degree of concurrency, only a small part of the structure
should be locked at a time. Relaxed balancing has become a commonly used concept in the design of concurrent search tree algorithms. In relaxed balanced data
structures, rebalancing is uncoupled from updates and may be arbitrarily delayed.
This contrasts with strict balancing, where rebalancing is performed immediately
after an update. Hanke [6] presents an experimental comparison of the strictly
balanced red-black tree and three relaxed balancing algorithms for red-black trees,
using the simulation of a multiprocessor machine. The results indicate that the
relaxed schemes have significantly better performance than the strictly balanced
version.
Motivated by Hanke’s results, the first part investigates the hypothesis that a relaxed balancing scheme is better suited for search-tree based dynamic IP router
tables than a scheme that utilizes strict balancing. To this end, we propose the
relaxed balanced min-augmented range tree and benchmark it with the strictly
balanced version of the tree using real IPv4 routing data. In order to carry out a
plausibility consideration, which corroborates the correctness of the proposed locking schemes, we will present an interactive visualization of the relaxed balanced
min-augmented range tree.
7
8
Chapter 1. Introduction
The R-tree, one of the most influential multidimensional access methods, was
proposed by Guttman in 1984. R-tree applications cover a wide spectrum, from
geographical information systems, computer-aided design to computer vision and
robotics. R-trees are hierarchical data structures that are used for the dynamic
organization of a set of d-dimensional geometric objects. The challenge for R-trees
is the following: dynamically maintain the structure in a way that retrieval operations are supported efficiently. Common retrieval operations are range queries,
i.e., find all objects that a query region intersects, or point queries, i.e., find all
objects that contain a query point. The R-tree and its variants have not been
experimentally evaluated and benchmarked for their eligibility for the packet classification problem.
In the second part we will investigate if the popular R*-tree is suited for packet
classification in a static environment. To this end we will benchmark the R*-tree
with two representative classification algorithms using the ClassBench tools suite.
Most of the proposed classification algorithms do not support fast incremental updates. If the R*-tree shows to be suitable in a static classification scenario, then
the benchmark is a stepping stone for benchmarking R*-trees in a dynamic classification environment, i.e., where classification is intermixed with filter updates.
We have seen that filters can lead to ambiguities in the packet classification process. This is due to the fact that packets might match multiple filters, each with
a different associated action. Hari et al. [7] noticed that not every policy can be
enforced by assigning priorities and applying the filter with the highest priority.
The authors suggest a scheme that utilizes the most specific tiebreaker (MSTB),
analogous to the most specific tiebreaker in one-dimensional IP lookup. If the
most specific tiebreaker is to be applied, it must be ensured that for each packet
there is a well defined most specific filter that applies to p. In one-dimensional
prefix tables, any two filters are either disjoint or one is completely contained in
the other. Therefore, for an incoming packet p the most specific filter that matches
p is well defined. In higher dimensions, filters may partially overlap. Hence, for
points falling in the overlap region, the most specific filter may not be defined.
Hari et al.’s seminal technique adds so-called “resolve filters” for each pair of partially overlapping filters which guarantees that the most specific tiebreaker can be
applied.
The third part of this dissertation proposes a conflict detection and resolution algorithm for static one-dimensional range tables containing arbitrary ranges. We
are motivated to study the one-dimensional case for the following reason. Multidimensional classifiers typically have one or more fields that are arbitrary ranges.
Since a solution for multidimensional conflict detection often builds on data structures for the one-dimensional case, it is beneficial to develop efficient solutions for
one-dimensional range router tables.
1.3. Organization
1.3
Organization
The remainder of this dissertation is organized as follows. Each of the three objectives is presented in a separate part. These parts can be read independently
of each other. Each part has an introduction, surveys related work, presents the
contributions and terminates with a summary and future directions. Finally, the
dissertation concludes with an overall summary of contributions.
9
Part I
IP Address Lookup
Chapter 2
Introduction
The internet is a system of immense scale. Changes in network topologies due to
physical link failures, link repairs or the addition of new routers and links happen
quite frequently as indicated by high volumes of routing updates [8]. This information must be flooded to all routers in the network as soon as possible after the
event. Routers must then update their routing tables accordingly. Min-augmented
range trees (MART) were introduced by Datta and Ottmann as a conceptually simple tree structure for maintaining dynamic IP router tables [9]. Maintaining the
forwarding table in a min-augmented range tree, the complexity of IP lookup is
in O(h), where h is the height of the tree. Hence it is desirable to maintain the
underlying search tree balanced.
In order to accelerate the lookup and update operations, min-augmented range
trees must be implemented in a way that they can be queried and modified concurrently by several processes. Trees with relaxed balance are defined to facilitate
fast updating in a concurrent database environment, since the rebalancing tasks
can be performed gradually after urgent updates. However, weaker constraints
than the usual ones are maintained such that the tree can still be balanced efficiently.
Uncoupling was first discussed in connection with red-black trees [10], and later
in connection with AVL trees [11]. Since then, several relaxed balancing schemes
have been proposed [12] [13] [14] [15]. Relaxed data structures with group updates
have been proposed in [16] [17] [18].
In this part we propose the relaxed balanced min-augmented range tree and investigate the hypothesis that the relaxed balanced min-augmented range tree is
better suited for the representation of dynamic IP router tables than the strictly
balanced version of the tree. To this end, we benchmark these two structures using real IPv4 routing data. To our knowledge, there are no other approches which
examine relaxed balancing in the context of forwarding or packet classification in
general.
The research and implementation in this part was carried out in collaboration with
14
Chapter 2. Introduction
Thorsten Seddig, Bettina Bär, Tobias Lauer and Thomas Ottmann. Thorsten Seddig has implemented the RMART within the scope of his diploma thesis [19]. The
MART has been implemented by my collegue Tobias Lauer. Bettina Bär has given
support in the implementation and realization of the benchmark of the concurrent
MART and the RMART. A visualization of the RMART was developed in collaboration with Waldemar Wittmann in line with his bachelor thesis [20].
2.1
Organization of part I
The remainder of this part is organized as follows. Section 2.2 describes another
geometric interpretation of IP lookup, while section 2.3 reviews related work. The
min-augmented range tree for the representation of dynamic forwarding tables is
presented in chapter 3. Relaxed balanced red-black trees, which form the basis
of the relaxed balanced min-augmented range tree, are presented in chapter 4.
The locking strategies for both the strictly and relaxed balanced min-augmented
range trees are described in chapter 6. An interactive animation of the RMART
is presented in chapter 7. Finally, benchmark results are discussed.
2.2
Another geometric interpretation of IP lookup
As we have seen in section 1.1, the longest prefix match problem can be mapped
into the geometric problem of finding the shortest interval on a line containing a
query point. Intervals on the line can be mapped to points in the plane and vice
versa, because both entities are defined by two values. If we map an interval [l, r]
with start point l and finish point r to the point (r, l) in the plane, a set of intervals
on the line is mapped to a set of points below the main diagonal in the plane. Let
us denote this mapping by map1, following the notation by Lu and Sahni [21].
A point p is said to stab an interval [l, r] if p ∈ [l, r]. A stabbing query reports
all intervals that are stabbed by a given query point. It has been observed that
stabbing queries for sets of intervals on the line can be translated to range queries
for so called south-grounded, semi-infinite ranges of points in the plane [22]. More
precisely: p ∈ [l, r] if map1 (l, r) is to the right and below the point (p, p).
For two intervals [l, r] and [l0 , r0 ] the point map1 (l, r) lies to the left and above the
point map1 (l0 , r0 ) iff [l, r] is contained in [l0 , r0 ]. Hence, finding the most specific
interval containing a given point p corresponds to finding, for the point p = (p, p)
on the main diagonal, the topmost and leftmost point (r, l) that is right and below
of p, cf. Figure 2.1. Note that if the most specific interval exists, there is always
a unique topmost-leftmost point.
Thus, solving the dynamic version of the IP lookup problem for prefix filters means
to maintain a set of points in the plane for which we can carry out insertions
and deletions of points and answer topmost-leftmost queries efficiently. Topmost-
2.3. Related work
Figure 2.1: A set of intervals mapped to a set of points. The longest matching
prefix is the topmost-leftmost point below the query point p = (p, p).
leftmost queries can be reduced to leftmost queries when ensuring that no two
points have the same x-coordinate. This can be accomplished by mapping each
point (x, y) to the point (2w x − y + 2w − 1, y) [21].
Note that it is not possible to have a leftmost point which is not also the highest
point in the semi-infinite range to the right and below a query point on the main
diagonal. This is due to the fact that intervals that are specified by prefixes
have the property that any two intervals are either disjoint or one is completely
contained in the other.
2.3
Related work
Longest Prefix Matching has received significant attention due to the fundamental
role it plays in the performance of Internet routers. If the set of prefixes is small,
a linear search through a list of the prefixes sorted in order of decreasing length
may be sufficient. The sorting step guarantees that the first matching prefix in
the list is the longest matching prefix for the given search key. Linear search is
commonly touted as the most memory efficient of all LPM techniques in that the
memory requirement is O(n), where n is the number of prefixes in the table. Note
that the search time is also O(n).
Several more sophisticated techniques have been developed to improve the speed
of address lookup. Each technique’s performance can be measured in terms of the
time required for lookup, the storage space required and the complexity of updating the filter set when a filter is added, deleted or changed.
Many solutions are based on the fundamental trie structure [23]. A trie is a binary tree with labeled branches. Each node v represents a bit-string formed by
concatenating the labels of all branches on the path from the root node to v. All
the descendants of any one node have a common prefix of the string associated
with that node, and the root is associated with the empty string. An example of a
15
16
Chapter 2. Introduction
Figure 2.2: Example of Longest Prefix Matching using a binary trie. The values
in the nodes denote the associated output link information.
binary trie constructed from the set of prefixes in Figure 1.1 is shown in Figure 2.2.
If a node is associated with a prefix, it stores the corresponding output link for
packets destined for the respective network. Tries allow finding, in a straightforward way, the longest prefix that matches a given destination address. IP lookup
is conducted by traversing the trie using the bits of the destination address of a
packet p, starting with the most significant bit. While traversing the trie, every
time we visit a node that is associated with a prefix we remember that prefix as
the longest match found so far. The last prefix encountered on the path is the
longest prefix that matches p [24]. As in the previous examples, the best matching
prefix for destination address 1101100 is 11011* and the corresponding output link
is seven. Note that the worst-case search time is now O(w), where w is the length
of the address and maximum prefix length in bits.
Update operations are also straightforward to implement in binary tries [24]. Inserting a prefix begins by doing a search. When arriving at a node with no branch
to take, we can insert the necessary nodes. Deleting a prefix starts again by a
search, unmarking the node as prefix and, if necessary, deleting unused nodes.
Several schemes have been proposed to improve the lookup performance of binary tries, e.g., multibit tries [25] and shape shifting tries [26]. These strategies
collapse several levels of each subtree of a binary trie into a single node, that can
be searched with a number of memory accesses that is less than the number of
2.3. Related work
levels collapsed.
Lu and Sahni [27] propose a method to partition a static IP router table such that
each partition is represented using a base structure such as a multibit trie [25] or
a hybrid shape shifting trie [28]. The partition results in an overall reduction in
the number of memory accesses needed for a lookup and a reduction in the total
memory required.
The fundamental issue with trie-based techniques is that performance and scalability are fundamentally tied to address length. With the future transition to IPv6, it
is not clear if trie-based solutions will be capable of meeting performance demands.
Several solutions utilize the geometric view of the filter set. Lee et al. [29] propose
an algorithm which is based on the segment tree. The segment tree is a well-known
data structure in computational geometry for handling intervals. The skeleton of
the segment tree is static. After the skeleton has been built over the given set of
intervals, these intervals can be stored in a dynamic fashion, that is, supporting
insertions and deletions. First, the so-called elementary intervals are computed
which will be stored in the leaves. Each node or leaf v stores the interval IntR(v)
that it represents and a set I(v) of intervals. A parent node represents the union
of the intevals of its children. The set I(v) contains the intervals [x, x0 ] such that
IntR(v) is included in [x, x0 ] and IntR(parent(v)) is not included in [x, x0 ]. An
interval [x, x0 ] is stored at a number of nodes that together cover the interval, and
these intervals are chosen as close to the root as possible. Every interval is stored
at at most two nodes per level. An interval i is inserted as follows: from the
root node check whether i contains the interval represented by that node. If yes,
allocate i there. Otherwise, do the same check recursively for the children nodes
whose intervals are overlapping i. Figure 2.3 shows the segment tree storing the
intervals a = [0, 7], b = [0, 5], c = [2, 3] and d = [6, 7]. The elementary intervals
are: [0, 1], [2, 3], [4, 5], [6, 7]. The interval stored in each node v represents IntR(v).
There always exists a unique shortest segment among the segments stored at each
node. For each node of the segment tree, a pointer is maintained that points to
that shortest segment. To find the most specific range (msr) stabbed by a query
point p, the segment tree is used as a search tree for p, i.e., we search for the
elementary interval that contains p. The last segment encountered is the shortest
segment over p. Suppose we search the msr for the query point 3 among the ranges
illustrated in Figure 2.3. The last segment encountered is interval C.
Given n IP prefixes, their algorithm performs IP address lookup in O(h) time,
where h is the height of the tree. Their approach can also handle insertions of IP
prefixes that don’t fit in the skeleton, but then the segment tree has to be rebuild
from time to time in order to maintain lookup performance. The algorithm performs insertion in time O(log n), and deletion in O(log n) time on average.
17
18
Chapter 2. Introduction
Figure 2.3: A segment tree storing the intervals a = [0, 7], b = [0, 5], c = [2, 3] and
d = [6, 7]. The interval stored in each node v represents IntR(v).
Figure 2.4: A set of intervals R = {a = [0, 30], b = [0, 10], c = [1, 9], d = [2, 8],
e = [12, 20], f = [14, 18], g = [22, 30], h = [23, 25] and i = [27, 29]}.
In the interval tree of [30], each node v stores a non-empty subset intervals(v)
of a set of intervals R. Let the median of an ordered sample (x1 , x2 , . . . , xn ) be
defined as:
(
x n+1 if n is odd,
2
median =
xn/2 if n is even.
Let xmed be the median of the interval endpoints. The root stores all intervals
that contain xmed . The right subtree stores all intervals that lie completely to
the right of xmed , and the left subtree stores all intervals completely to the left
of xmed . These subtrees are constructed recursively in the same way. In [30], two
seperate lists are maintained to store intervals(v). One list keeps the intervals
sorted according to increasing left endpoints, the other list maintains the intervals
sorted according to decreasing right endpoints. If intervals are nested, both lists
are identic. Consider the intervals in Figure 2.4. The interval tree storing the set
R of intervals is shown in Figure 2.5. The left endpoint of f is the median of all
the endpoints, and hence becomes the root of the tree. Intervals a, e and f contain
this endpoint and get attached to the root. Intervals b, c and d lie completely to
the left of 14 and get placed in its left subtree, g, h and i get placed in the right
subtree and so forth.
The longest matching prefix can be found in O(log n + k) time, where k is the
19
2.3. Related work
Figure 2.5: An interval tree storing the set S.
number of prefixes that match the given destination address. Suppose we search
for the longest matching prefix for the destination address 19 in the interval tree
in Figure 2.5. Intervals a and e both contain 19 and e is the most specific match.
Prefix insertion and deletion are expensive. Lu and Sahni [31] propose an enhancement of the interval tree of [30] for the representation of dynamic router
tables. The enhanced structure supports efficient insertion and deletion of ranges.
The longest matching prefix can be found in O(log n + k) time as in the original
structure. They further propose several refinements of the enhanced interval tree
for dynamic router tables. For example, LMPBOB (longest matching prefix binary
tree on binary tree), which permits lookup in O(w) time, where w is the length of
the longest prefix, and filter insertion and deletion in O(log n) time each, where n
is the number of prefixes in the forwarding table.
Another scheme proposed by Lu and Sahni [21] shows that each of the three operations insert, delete and IP lookup may be performed in O(log n) time in the
worst case using a priority-search tree.
The multi-way and multi-column search techniques presented by Lampson, Srinivasan, and Varghese map the longest matching prefix problem to a binary search
over the fixed-length endpoints of the intervals defined by the prefixes [32]. The
authors exploit the fact that any two prefixes are either disjoint or nested. For
a database of n prefixes with address length w, naive binary search would take
O(w ∗ log n). They show how to reduce this to O(w + log n) using multiple-column
binary search.
Warkhede, Suri and Varghese [33] introduce an IP lookup scheme based on a multiway range tree with worst case search and update time of O(log n), where n is the
number of prefixes in the forwarding table.
With the advances in optical networking technology, link rates reach over 40 Gigabits per second (OC768). Given the smallest packet size of 40 bytes, in order
to achieve 40 Gbps wire speed, the router needs to lookup packets at a speed
of 125 million packets per second. This, together with other needs in processing, amounts to less than eight nanoseconds per packet lookup. Such high rates
demand IP lookup to be performed in hardware. Originally, commercial routers
20
Chapter 2. Introduction
used Content Addressable Memory (CAM) for IP address lookups in order to keep
pace with optical link speeds [34]. CAMs locate an entry by comparing the input
key against all memory words in parallel. Hence, a lookup effectively requires one
clock cycle. While binary CAMs performed well for exact match operations and
could be used for route lookups in strictly hierarchical addressing schemes, the
introduction of CIDR required storing and searching entries with arbitrary prefix
lengths [34]. In response, Ternary Content Addressable Memories (TCAMs) were
developed with the ability to store an additional “Don’t Care” state thereby enabling them to retain single clock cycle lookups for arbitrary prefix lengths [34].
The use of TCAMs for routing table lookups was first proposed by McAuley and
Francis [35]. They also described the problem of updating TCAM-based routing
tables that are sorted with respect to prefix lengths. A more recent scheme for
filter updates in TCAMs was proposed in [36]. For example, the Cisco Catalyst
6500 Series Switch maintains its Forwarding Information Base (FIB) in TCAM
which is accessed by the hardware forwarding engine ASIC (application-specific
integrated circuit) [37]. TCAMs have several deficiencies [38]: (1) high cost per
bit relative to other memory technologies, (2) storage inefficiency, (3) high power
consumption, and (4) limited scalability to long input keys. The storage inefficiency comes from two sources. First, arbitrary ranges must be converted into
prefixes. For example, if w = 4, the range [2, 10] is represented by 001∗, 01∗, 100∗
and 1010, which exactly cover that range. In the worst case, a range covering wbit port numbers may require 2(w − 1) prefixes [38]. The second source of storage
inefficiency stems from the additional hardware required to implement the third
“Don’t Care” state. The massive parallelism inherent in TCAM architecture is
the source of high power consumption. A further deficiency stems from the lack
of flexibility and programmability [39].
CoolCAMs greatly reduce power dissipation and were proposed in [40]. Power
consumption is approximately proportional to the number of blocks searched. [40]
provide two different power efficient TCAM-based architectures for IP lookup.
Both architectures utilize a two stage lookup process. The basic idea in both cases
is to divide the TCAM device into multiple partitions. When a route lookup is
performed, the results of the first stage lookup are used to selectively search only
one of these partitions during the second stage lookup. The two architectures differ
in the mechanism for performing the first stage lookup. Zane, Narlikar and Basu
further investigate the performance of both architectures in the face of routing
table updates [40]. Adding prefixes may cause a bucket in the data TCAM to
overflow, requiring a repartitioning of the prefixes into buckets and rewriting the
entire table in the data TCAM. The authors describe several heuristics in order
to minimize the number of repartitions.
Spitznagel, Taylor, and Turner extend the basic idea of [40] and introduced Extended TCAM (E-TCAM) [41]. They propose an indexing mechanism that can
support multidimensional packet classification. A further extension of E-TCAM is
2.3. Related work
that the range-matching inefficiency is resolved by incorporating range-matching
logic directly into hardware at the cost of a small increase in hardware resources.
Perhaps the biggest piece missing from the Extended TCAM solution is an efficient
update procedure.
The other architectural approach to IP lookup uses more conventional memory
architectures like Static RAM (SRAM) and Reduced Latency Dynamic RAM (RLDRAM) and sophisticated data structures. Trie-based structures are widely used
in these solutions, e.g., in the Juniper M-series, MX-series and T-series, ASICdriven lookup is based on a (radix) trie [39]. In Cisco’s CRS-1 (Carrier Routing
System) high-end router, lookup is based on Tree Bitmap [42], a multibit trie algorithm, proposed by Eatherton, Varghese and Dittia [43].
Due to the serial nature of decision tree approaches, multiple clock cycles are
needed to perform IP lookup. In response, several researchers have explored
pipelining to improve the throughput [44] [45]. A pipeline is a collection of concurrent “entities” in which the output of each entity is used as the input to another.
We have seen that new solutions employ a combined algorithmic and architectural approach to the problem. In the following we will propose the relaxed minaugmented range tree (RMART), an efficient representation for dynamic IP forwarding tables, and outline a technique which can be used to describe the RMART
by a hardware description language.
21
Chapter 3
Min-augmented Range Trees
A min-augmented range tree (MART) [9] [46] maintaining a set of points with
pairwise different x-coordinates stores the points at the leaves such that it is a
leaf-search tree for the x-coordinates of points. The internal nodes have two fields,
a router field guiding the search to the leaves and a min field. In the router field
we store the maximum x-coordinate of the left subtree, we call this the x-value
property, and in the min field we store the minimum y-coordinate of any point
stored in the leaves of the subtrees of the node. The next section will provide an
example. In the following we show how to answer a leftmost, or minXinRectangle
(xlef t , ∞, ytop ) query.
3.1
Longest matching prefix
In order to find the longest matching prefix, we have to find the point p with
minimal x-coordinate in the semi-infinite range x ≥ xlef t and with y-coordinate
below the threshold value ytop . Therefore, we first carry out a search for the
boundary value xlef t . It ends at a leaf storing a point with minimal x-coordinate
larger than or equal to xlef t . If this point has a y-coordinate below the threshold
value ytop , we are done. Otherwise we retrace the search path for xlef t bottom-up
and inspect the roots of subtrees falling completely into the semi-infinite x-range.
These roots appear as right children of nodes on the search path. Among them
we determine the first one from below (which is also the leftmost one and) which
has a min field value below the threshold ytop . This subtree must contain the
answer to the minXinRectangle (xlef t , ∞, ytop ) query stored at its leaf. In order to
find it, we recursively proceed to the left child of the current node, if its min field
shows that the subtree contains a legal point, i.e., if its min field is (still) below
the threshold, and we proceed to the right child only, if we cannot go to the left
child (because the min field of the left child is above the threshold ytop ) [46]. Note
that in an actual implementation it is more efficient to truncate the initial search
for xlef t and begin retracing the path as soon as the min field of the currently
24
Chapter 3. Min-augmented Range Trees
Figure 3.1: The search path of the query minXinRectangle (35,80,34) in a MART.
Visited nodes are highlighted, the pink node is the result returned by the query.
In internal nodes, the bottom value represents the router field, the top one the
min field. From [47].
inspected node is above ytop [47]. A min-augmented range tree storing a set of 16
points at the leaves is visualized in Figure 3.1. In internal nodes, the bottom value
represents the router field, the top one the min field. The search path of the query
minXinRectangle (35,80,34) is highlighted.
We can find the desired point in time which is proportional to the height of the
underlying leaf-search tree. Hence it is desirable to maintain the underlying tree
balanced. All we have to show is that the augmented information stored in the
min fields of nodes can be efficiently maintained when we carry out an update
operation and rebalance the underlying search tree.
3.2
Update operations
To show that the augmented information can efficiently be maintained during update operations, it is appropriate to think of an update operation for the underlying
balanced leaf-search tree as consisting of two successive phases [46]. In the first
phase, we insert or delete a point as in a normal (unbalanced) binary leaf-search
tree, and in the second phase we retrace the search path and carry out rebalancing
operations, if necessary. In order to update the information stored in the min
fields of internal nodes, the first phase has to be extended as follows. We retrace
the search path and carry out a tournament starting from the leaf affected by the
update operation: We recursively consider the min fields of the current node and
its sibling and store the minimum of both in the min field of their common parent.
In this way we correctly update the information stored in the min fields after the
first phase. Instead of retracing the search path in order to update the min fields,
the insertion process could also modify the min fields top-down.
3.3. Comparison with priority search trees and priority search pennants
In order to show that this information can also be maintained during the second
phase, i.e., during rebalancing, let us consider a right rotation. Here we assume
that a, b, c, d, e are the routers in increasing x-order stored in the router fields of
the internal nodes. The values of the min fields before the rotations are u, v, w, x, y
and u, v 0 , w, x0 , y after the rotation. Note that u, w, y have not to be changed, because their subtrees are not affected by the rotation. We have just to update the
min fields of the nodes A and B. Note, however, that the min value stored at
node A is (still) the overall min value x of all subtrees 1, 2, and 3, hence, we define
x0 = x. Choosing v 0 = min(w, y) will finally restore the min fields correctly, cf.
Figure 3.2.
Figure 3.2: MART right rotation. The min field values are shown on top of the
router values. From [9].
Rotations and the process of maintaining the augmented min-information are
strictly local, hence we can freely choose an underlying balancing scheme for minaugmented range trees. Furthermore, the locality property enables us to decouple
the update and rebalancing operations as will be shown in chapter 5.
3.3
Comparison with priority search trees and
priority search pennants
Lauer [47] has benchmarked the MART with (1) the priority search tree as used in
the approach by Lu and Sahni [21], herein after referred to as PST, (2) the priority
search tree as suggested by McCreight [22], herein after referred to as PST McC,
both of which are balanced with red-black trees, and (3) the priority search pennant by Hinze [48], herein after referred to as PSP. Lauer shows how balanced
PSPs answer longest matching queries and proves that the complexity is bounded
by O(log n). The MART and PSP were implemented and benchmarked with two
balancing schemes each: internal path reduction and red-black trees. These balancing schemes are strict balancing schemes, i.e., the balance condition is restored
immediately after each update. The MART is the simplest structure in terms of
node complexity, i.e., the number of values stored per node that are inspected
25
26
Chapter 3. Min-augmented Range Trees
during a search operation, followed by the PST, PSP, and finally PST McC. For
minXinRectangle queries, the length of the search path, i.e., the number of inspected nodes during the search, was measured. The results have shown that
the average search path length of the MART was longer compared to the other
structures. Yet, concerning runtime performance, the results have shown that the
search path length is less crucial than the number and type of comparisons inside
each node along the path. The simple node structure of a MART node compensates the longer search paths. For minXinRectangle queries the MART needed
45% less time than the PST.
The choice of the balancing scheme for the MART and PSP turned out to be of
rather low importance in terms of search time.
In terms of updates, the MART and PSP require fewer node manipulations than
PSTs. When the same balancing scheme is applied, the MART requires about
30% less node manipulations on average during an insertion than PSTs. In case
of deletions, the reduction is 27%. However, in terms of runtime, the performance
gain of the MART is only about 10% in case of insertions and 15% - 20% in case
of deletions compared to PSTs.
Chapter 4
Relaxed Balancing
In order to accelerate lookup and update operations, routing tables must be implemented in a way that they can be queried and modified concurrently by several
processes. If implemented in a concurrent environment there must be a way to
prevent simultaneous reading and writing of the same parts of the data structure.
A common strategy is to lock the critical parts. In order to allow a high degree
of concurrency, only a small part of the tree should be locked at a time. Relaxed
balancing has become a commonly used concept in the design of concurrent search
tree algorithms [6]. Instead of requiring that the balance condition is restored
immediately after each update, the balance conditions are relaxed such that the
rebalancing operations can be delayed and interleaved with search and update operations. In the following we will present the relaxed red-black tree as proposed in
Hanke, Ottmann and Soisalon-Soininen [14] as we utilized this scheme to relax the
min-augmented range tree. This scheme can be applied to any class of balanced
trees. The main idea is to use the same rebalancing operations as for the standard
(strictly balanced) version of the tree. In the following we recapitulate red-black
trees in order to build the basis for section 4.2.
4.1
Red-black trees
A red-black tree is a binary search tree with the following red-black properties [49]:
• Every node is either red or black.
• Every leaf is black.
• If a node is red, then both its children are black.
• Every path from the root to a leaf contains the same number of black nodes.
• The root node is black.
28
Chapter 4. Relaxed Balancing
Figure 4.1: Call of the rebalancing procedure up-in (denoted by ↑). Filled nodes
denote black nodes. From [14].
These constraints enforce a critical property of red-black trees: the longest possible
path from the root to a leaf is no more than twice as long as the shortest possible
path. A red-black tree with n internal nodes has height at most 2 lg(n + 1). The
immediate result of an insertion or removal may violate the properties of a redblack tree. Restoring the red-black properties requires a small number (O(log n) or
amortized O(1)) of color changes and no more than three tree rotations (maximum
two for an insertion).
4.1.1
Insertions
In order to insert a new key we first locate its position among the leaves and
replace the leaf by an internal red node v with two black leaves. If the parent of
v is red, we must restore the balance condition and call the rebalancing procedure
up-in for v, cf. Figure 4.1.
4.1.2
Deletions
In order to delete a key we first locate its position among the leaves and then
remove the leaf together with its parent. If the parent is red, then remove the leaf
together with its parent. If the removed parent was black the balance condition is
violated. If the sibling is red, we just change it to black. Otherwise, the removal
leads to a call of the rebalancig procedure up-out for the remaining leaf, cf. Figure
4.2.
4.2. Relaxed balanced red-black trees
Figure 4.2: Deletion of an item (denoted by x) and call of the rebalancing procedure
up-out (denoted by ↓). From [14].
Figure 4.3: Call of the rebalancing procedure up-out (denoted by ↓). Half filled
nodes denote nodes that are either black or red. From [14].
The task of the procedure up-out attached to some node v is to increase the black
height of the subtree rooted at v by one. It either performs a structural change
and settles the request or it moves up in the tree, cf. Figure 4.3.
4.2
Relaxed balanced red-black trees
In order to uncouple the rebalancing tasks from an update, we only deposit an
up-in or an up-out request instead of calling these procedures immediately after
an update. The relaxed balance conditions require that [14]
1. on each path from the root to the leaf, the sum of the number of black nodes
plus the number of up-out requests is the same
2. each red node has either a black parent or an up-in request
3. all leaves are black
29
30
Chapter 4. Relaxed Balancing
Figure 4.4: Deletion of an item (denoted by x).
The rebalancing requests can be carried out concurrently as long as they do not
interfere at the same nodes. This can be achieved by settling the rebalancing
requests in a top-down manner [6]. In order to facilitate the locating of rebalancing
requests in the tree, we utilize a problem queue as proposed in [50]. For each type
of request, we maintain a seperate queue. If all problem queues are empty, the
tree is (strictly) balanced. Each node maintains an additional bit for each problem
queue, which is set as soon as a link to that node is inserted in the queue. When
the request in the queue is deleted, the rebalancing process resets this bit to zero.
To avoid side effects, every node has only one request of the same type.
Since the same rebalancing operations as for the standard version of the tree are
used, the number of rebalancing operations from the strict balancing scheme carry
over to relaxed balancing.
If a deletion falls into a leaf which has a red parent with an up-in request the leaf
is immediately deleted and the up-in request abandoned, cf. Figure 4.4. Thus, a
number of insertions and subsequent deletions of the same nodes will not cause
any rebalancing requests and hence no rebalancing operations are required.
Otherwise, we just deposit a removal request at the appropriate leaf. The deallocation, cf. Figure 4.2, is thus part of a rebalancing process.
4.2.1
Interleaving updates
If an insertion falls into a leaf which has a removal request the removal request is
abandoned and the key reinserted at that leaf.
If the leaf has an up-out request the up-out request is removed and the leaf is
replaced by an internal black node with two leaves.
If a deletion falls into a leaf v with a red parent, and the leaf’s sibling has an upout, up-in- or removal request, it does not interfere with the deletion and remains
attached, and, once the removal request attached at v is being handled, the leaf
together with its parent is removed, as in Figure 4.2 (a). If the removal request
falls into a leaf with an up-out request or whose parent has an up-out request,
these requests have to be settled first.
4.2. Relaxed balanced red-black trees
Figure 4.5: Concurrent handling of the procedure up-out (denoted by ↓). From
[14].
4.2.2
Concurrent handling of rebalancing transformations
Two up-out requests that occur at sibling nodes are in conflict. However, this can
be solved by applying the transformations in Figure 4.5 (a) or (b).
If two up-out requests occur in the same area as in Figure 4.5 (c), this conflict can
be settled by one rotation and recoloring.
In the following chapter we will present a strategy to update min-augmented range
trees in such a way that the rebalancing tasks can be left for separate processes
that perform several local modifications in the tree.
31
Chapter 5
Relaxed Min-Augmented Range
Trees
We will now examine how the min-augmented range tree and the relaxed balanced red-black tree can be combined into the relaxed min-augmented range tree
RMART.
The relaxed balance conditions stay unmodified:
1. on each path from the root to the leaf, the sum of the number of black nodes
plus the number of up-out requests is the same
2. each red node has either a black parent or an up-in request
3. all leaves are black
All features of the min-augmented range tree remain valid, except for the x-value
property. After a deletion, every node may have an x-value that is larger or equal
to the highest x-value of its left subtree and smaller than the smallest x-value in
the right subtree. Hence, we do not have to update the x-values after a deletion.
The insertion process modifies the min fields top-down. In order to update the min
fields after a deletion we introduce the update request up-y. Hence, in addition
to up-in, up-out and removal requests, red and black nodes can also have upy requests. Furthermore, we need one additional problem queue for the up-y
requests.
If the deletion falls into a leaf whose red parent has an up-in request, the leaf
together with its parent are removed, cmp. Figure 4.4. Additionally, we may have
to attach an up-y request to the leaf’s sibling, cf. Figure 5.1.
Otherwise, as in the case of relaxed balanced red-black trees, we only deposit a
removal request.
When we settle a removal request which resides in a leaf with a red parent, the
leaf together with its parent are removed, cf. Figure 4.2 (a). Additionally, an up-y
request is attached to the leaf’s sibling. When we settle a removal request which
34
Chapter 5. Relaxed Min-Augmented Range Trees
Figure 5.1: Deletion of an item and attachment of the rebalancing procedure up-y
(denoted by 4).
Figure 5.2: Handling of an up-y request (denoted by 4). The values stored in the
nodes denote the min fields, router values are omitted.
resides in a leaf with a black parent and red sibling, we remove the leaf together
with its parent, and color the sibling black (as in Figure 4.2 (b)). Additionally, we
attach an up-y request to the leaf’s sibling in order to restore the min fields.
When we settle a removal request which resides in a leaf with a black parent and
sibling, we remove the leaf together with its parent, and attach an up-out request
to the leaf’s sibling (Figure 4.2 (c)). Additionally, we attach an up-y request to
the leaf’s sibling.
If a node v has an up-y request we must update the predecessors’ min fields if
necessary. To settle an up-y request we compare the min field values of v and
its sibling. Let m be the minimum of these values. If m equals the min field
value of parent(v), we delete the up-y request at v. Otherwise, we delete the upy request at v, update the min field of parent(v) to the new value m, and shift
the up-y request to parent(p), if it doesn’t have an up-y request yet, see Figure 5.2.
5.1
Longest matching prefix
If the search for the longest matching prefix ends at a leaf with a higher y-value
than the threshold or a removal request, the search has to track back. If the y-fields
along the path have not yet been (completely) updated, the search might again
end at a leaf with an invalid y-value or a removal request. Hence, the search has to
be modified in such a way that all leaves with a valid x-value are visited one after
the other (in ascending x-values) until a leaf with a valid y-value has been found.
In the following we will show that only one node on average (averaged over the
number of y-min paths, see below) has to be updated in order to settle an up-y
request. Clearly, this benefits IP lookup since numerous backtracking is avoided.
5.2. Interleaving updates
Definition 1. Let v be a leaf with min field y. The y-min path of v contains v
and all ancestor nodes of v with min field y.
A node k belongs to the y-min path of a leaf v, if v stores the minimum y-value of
all y-values of leaves stored in the subtrees of k. Consider for example the MART
in Figure 3.1. The y-min path of the leaf storing the point (55, 40) contains the
leaf itself, its parent and grandparent. The y-min path of the leaf storing the point
(40, 61) only contains the leaf itself.
The length of a y-min path is the number of internal nodes it contains. The
maximum length of a y-min path equals the height of the tree.
Theorem 1. The average y-min path length is N/(N+1), i.e., bounded by 1.
Proof. Let N be the number of internal nodes. The sum of the y-min path lengths
is N , as each internal node belongs to exactly one y-min path. There are N + 1
leaves and hence N + 1 y-min paths.
5.2
Interleaving updates
If an insertion falls into a leaf with an up-y request, the procedure stays the same
as for the relaxed balanced red-black tree and the up-y request is shifted to the
new internal node.
If the leaf has a removal and an up-y request, then we insert the new point and
detach the removal request.
If an up-out request and an up-y request meet at a node v, then the up-out transformations that don’t involve rotations do not interfere with the up-y requests,
cf. Figure 4.3 (b) and (e). If rotations are involved, the up-y request at v or at
nodes that are involved in the rotation are shifted out of the sphere of influence
of the rotation prior to the up-out transformations. As in the case of rotations
in standard min-augmented range trees, the min fields of the involved nodes are
maintained during a rotation.
35
Chapter 6
Concurrency Control
Concurrent computing is related to parallel computing, but focuses more on the
interactions between processes. In order to cooperate, concurrently executing processes must communicate and synchronize. Interprocess communication is based
on the use of shared variables (variables that can be referenced by more than one
process) or on message passing [51]. If several processes operate concurrently on
shared data then unintended results might occur. This happens when two operations, running in different threads, but acting on the same data, interleave. This
means that the two operations consist of multiple steps, and the sequences of steps
overlap [52]. A sequence of statements that must appear to be executed as an indivisible operation is called a critical section. The term “mutual exclusion” refers
to mutually exclusive execution of critical sections [51].
A common communication strategy in tree structures is that processes use various
kinds of locks while traversing the tree. For example, Ellis proposed a locking
protocol for strictly balanced search trees [53]. Nurmi and Soisalon-Soininen presented a modification of Ellis’ scheme for relaxed balanced search trees [54]. Both
protocols use r,- w- and x-locks. Search operations place a shared lock (or r-lock)
on nodes, update operations place write locks (or w-locks) and exclusive locks (or
x-locks) on nodes. Several processes can hold an r-lock at a node at the same
time. Yet, only one process can hold a w- or x-lock at a node. Furthermore, a
node can be both w-locked by one process and r-locked by several other processes.
An x-locked node cannot be r- w- or x-locked simultaneously by another process.
An update process uses w-locks, if it wants to exclude other update processes, but
does not want to exclude search processes. If an update process changes the search
structure, then it uses x-locks.
A w-lock can be changed into an x-lock if the node is not r-locked by another
process. An x-lock can always be changed into a w-lock.
In both the strictly as well as relaxed balanced MART, a node maintains the information if it is w- or x-locked by a process in the boolean variables nodeWriteLock
38
Chapter 6. Concurrency Control
and nodeXLock. It further maintains the amount of processes that hold an r-lock
in the integer variable nodeReadLock. A boolean variable is not sufficient since the
node can be r-locked by several processes.
The procedures to set the various locks are visualized in Algorithm 1, 2 and 3. As
can be seen, an x-lock can only be obtained via a w-lock, i.e., if a thread wants to
x-lock a node, it first must obtain a w-lock. Algorithm 4 shows how an x-lock is
changed into a w-lock. All these locking procedures must be synchronized in order
to prevent that several processes call the methods simultaneously.
A monitor consists of a collection of permanent variables used to store the resources’s state, and some procedures, which implement operations on the resource.
The permanent variables may be accessed only from within the monitor. Execution
of the procedures in a given monitor is guaranteed to be mutually exclusive. This
ensures that the permanent variables are never accessed concurrently [51]. The
concept of a monitor was introduced by Brinch-Hansen [55]. A monitor is formed
by encapsulating both a resource definition and operations that manipulate it [51].
This textual grouping of critical sections together with the data which are manipulated is superior to critical sections scattered throughout the user program as
described by Dijkstra and Hoare in the early days of concurrent programming [56].
The MART as well as the RMART are implemented in the Java programming
language. Java provides a basic synchronization idiom: the synchronized keyword.
Making methods synchronized has the effect that it is not possible for several
invocations of synchronized methods on the same object to interleave. When one
thread is executing a synchronized method for an object, all other threads that
invoke synchronized methods for the same object block (suspend execution) until
the first thread is done with the object. All methods to set, release or change a
lock are synchronized.
Algorithm 1 Set r- Lock
1: procedure setNodeReadLock
2:
if ! nodeXLock then
3:
nodeReadLock ← nodeReadLock + 1
4:
return true
5:
else
6:
return false
7:
end if
8: end procedure
39
Algorithm 2 Set w- Lock
1: procedure setNodeWriteLock
2:
if ! nodeW riteLock && !nodeXLock then
3:
nodeW riteLock ← true
4:
return true
5:
else
6:
return false
7:
end if
8: end procedure
Algorithm 3 Set x- Lock
1: procedure setNodeXLock
2:
if nodeReadLock == 0 && nodeW riteLock then
3:
nodeXLock ← true
4:
nodeW riteLock ← f alse
5:
return true
6:
else
7:
return false
8:
end if
9: end procedure
Algorithm 4 Change x-lock to w-lock
1: procedure changeNodeXLockToNodeWriteLock
2:
if nodeXLock then
3:
nodeW riteLock ← true
4:
nodeXLock ← f alse
5:
return true
6:
else
7:
return false
8:
end if
9: end procedure
40
Chapter 6. Concurrency Control
Figure 6.1: Potential for deadlock.
6.1
The deadlock problem
Requests by separate tasks for “resources” may possibly be granted in such a
sequence that a group of two or more tasks is unable to proceed–each task monopolizing resources and waiting for the release of resources currently held by others
in that group [57]. For example, consider two tasks P and Q, each requiring the
exclusive use of two different resources A and B. Clearly, if P obtains A at the
same time Q obtains B, a deadlock occurs since neither of them can proceed to
obtain the other resource it needs, see Figure 6.1.
There are four conditions required for deadlock [57]:
• Mutual exclusion. Only one process may use the shared resource at a time.
• “Wait for”. Processes may hold allocated resources while awaiting assignment of others.
• “No preemption”. Once a resource is held by a process, it cannot be forcibly
removed from the process.
• “Circular wait”. There exists a circular chain of processes, such that each
process holds one or more resources that are being requested by the next
task in the chain.
Deadlocks can be expressed more precisely in terms of graphs, for details see [57]
[58].
In the following two sections we present the locking protocol by Ellis [53] for strictly
balanced trees and the protocol by Nurmi and Soisalon-Soininen [54] for relaxed
balanced trees and show how these can be adapted to the (R)MART.
41
6.2. Strictly balanced trees
Figure 6.2: Insertion of a key.
6.2
Strictly balanced trees
In a conventional bottom-up rebalancing scheme rebalancing transformations are
carried out when an updater returns from the inserted or deleted node to the root.
If a bottom-up method is used in a concurrent environment, the path from the root
to a leaf needs to be locked for the time a writer operates; otherwise the process
can lose the path to the root. During the time the root is locked by an updater,
no other update process can access the tree. Thus, at most one updater can be
active at a time.
Search processes only traverse the tree top-bottom and use r-lock coupling, i.e., a
search process r-locks the node to be visited next before it releases the lock on the
currently visited node.
An insertion process w-locks the entire path from the root to the leaf, where the
key is inserted, in order to be able to rebalance immediately after the insertion. If
the key is already in the tree, all locks along the path are released and the insertion
process terminates. Otherwise, the leaf’s w-lock is changed into an x-lock. The
leaf is changed into an inner node with two leaves as children, cf. Figure 6.2. The
key stored in it is copied into one new leaf; the key to be inserted is stored in the
other new leaf. Then, if necessary, the x-value in the new internal node is adjusted.
By using this technique, the parent of the leaf, where the search terminated, does
not have to be x-locked.
Afterwards, the tree is rebalanced. To check which rebalancing transformation
applies, all relevant nodes are w-locked top-down. If only the node colors change,
w-locks are sufficient. If rotations are involved, then the w-locks are changed into
x-locks in top-down direction. X-locks are necessary because the tree structure
changes and a search is not excluded by a w-lock. Hence x-locks guarantee that
searches are not mislead. After the rotaion, the locks of the nodes that are locally in balance are released top-down. When the tree is completely in balance,
all remaining w-locks are released bottom-up and the insertion process terminates.
42
Chapter 6. Concurrency Control
(a) Backtracking search
(b) Backtrack further
(c) Use r-lock coupling
to descend the tree
Figure 6.3: Locking protocol for backtracking searches.
The locking strategy of a deletion process is similar. The difference is that not
only the leaf, but also the grandparent, the parent and the sibling have to be xlocked. This is done in top-down direction after they have been w-locked. The leaf
together with its parent are deleted and the grandparent now points to the sibling.
Afterwards, all remaining locks are released. The locking strategy for rebalancing
the tree remains the same as in the insertion process.
Clearly, this locking scheme does not evoke deadlock situations among update processes. The root remains w-locked during the entire update and hence no other
update process is allowed to enter the tree. An update process x-locks consecutive
nodes only in top-down direction. Hence, this strategy does not evoke deadlock
situations among an update process and search processes. Hence, this scheme is
deadlock-free.
In the following we adapt this locking scheme to the MART.
6.2.1
Concurrent MART
The protocol for a search operation uses r-lock coupling. In case the search has
to track back, it has to determine the first subtree (from below) that has a min
field value below the threshold. Therefore, while backtracking, not only the parent, but also the sibling are r-locked, cf. Figure 6.3(a). If the search has to track
back further, the lowermost r-locks are released, and the grandparent and uncle
are r-locked, cf. Figure 6.3(b). As soon as the correct subtree has been found, the
leftmost r-locks are released and r-lock coupling is used to descend the tree, cf.
Figure 6.3(c). When a leaf is reached, the key is returned.
An insertion process uses x-locks in addition to w-locks because it updates the
6.2. Strictly balanced trees
min fields top-down when it searches for the insertion position. After a w-lock
has been acquired, it is changed into an x-lock to update the min field. Then it
is retransformed to a w-lock and the appropriate child is w-locked. Algorithm 5
illustrates the search phase of the insertion process.
The acquisition of a w- and x-lock has to be carried out in a loop. When the
process is not granted the lock, it calls yield(). This causes the currently executing
thread to temporarily pause and allow other threads to execute. Then, it tries
again to acquire the lock. Algorithm 5 returns with an x-lock on the located leaf.
Deadlocks cannot occur in this phase.
After the correct insertion position has been located, the new key is inserted and
afterwards, the x-lock is transformed into a w-lock.
In the second phase, the insertion process calls the rebalancing procedure for the
new internal node v. If only node colors change, the uncle is w-locked (nodes
that are not on the insertion path are not locked yet) and the colors are adjusted.
Then, the uncle, the parent and v are unlocked. Finally, all remaining w-locks are
released bottom-up.
If rotations are involved, the locking strategy is as follows: All nodes beneath the
node v for which the rotation is called (the node which is rotated) are unlocked; all
nodes above v and including v are w-locked (they obtained their lock during the
search phase of the insertion process). Then, to execute the rotation, all relevant
nodes are w-locked and then x-locked top-down. Then, the pointers are adjusted
and the min fields are restored. After the rotation, all nodes beneath the node
v that is returned by the rotation are unlocked top-down. The nodes above and
including v are w-locked. Finally, after the tree has been rebalanced, all remaining
w-locks are released bottom-up.
This strategy does create deadlock situations with search processes that track back
and enter an area that contains the nodes that are to be rotated. When a rotation
is carried out, the insertion process x-locks the nodes top-down. Search processes
use r-lock coupling also when they track back. If a search process wants to r-lock
its parent which is already x-locked, it fails. And the insertion process fails to
x-lock the child due to the r-lock. Since a rebalancing operation is carried out
subsequent to the insertion, it insists on carrying out the rotation. If a search
process wants to r-lock its parent which is already x-locked, it does not necessarily
have to be a deadlock: the insertion process temporarily x-locks a node in order to
update the min field while searching for the insertion position. Hence, the search
process temporarily pauses and then tries again to r-lock the parent. If it is still
not granted the lock, it releases its current lock and restarts the search.
The deletion process first locates the appropriate leaf and thereby w-locks the entire path from the root to the leaf. In this phase, no deadlock situations arise since
w- and r-locks don’t exclude one another. After it has located the leaf, it w-locks
the sibling and then x-locks the appropriate nodes top-down: the grandparent, the
43
44
Chapter 6. Concurrency Control
Algorithm 5 Find Leaf Insert
1: procedure findLeafInsert(searchkey, y value)
2:
if M ART N ode.rootN ode is null then return null
3:
end if
4:
while (! M ART N ode.rootN ode.setN odeW riteLock()) do
5:
yield()
. Temporarily pause
6:
end while
7:
while (! M ART N ode.rootN ode.setN odeXLock()) do
8:
yield()
9:
end while
10:
M ART N ode l ← M ART N ode.rootN ode
11:
while (! l.isLeaf ()) do
12:
if l.getY ().compareT o(y value) > 0 then
13:
l.setY (y value)
. Update y value
14:
end if
15:
if l.getX().compareT o(searchkey) ≥ 0 then
. branch left
16:
l.changeN odeXLockT oN odeW riteLock()
17:
while (! l.getLef t()).setN odeW riteLock()) do
18:
yield()
19:
end while
20:
l ← l.getLef t()
21:
while (! l.setN odeXLock()) do
22:
yield()
23:
end while
24:
else
. branch right
25:
l.changeN odeXLockT oN odeW riteLock()
26:
while (! l.getRight()).setN odeW riteLock()) do
27:
yield()
28:
end while
29:
l ← l.getRight()
30:
while (! l.setN odeXLock()) do
31:
yield()
32:
end while
33:
end if
34:
end while return l
35: end procedure
6.2. Strictly balanced trees
Figure 6.4: Deadlock between deletion process and backtracking search process.
Figure 6.5: Deletion of a key. Capital X denotes removed nodes.
parent and finally the sibling and the leaf itself. A deadlock situation can arise if
the deletion process encounters a backtracking search process while x-locking the
appropriate nodes. Suppose the grandparent has been successfully x-locked and
the deletion process now intends to x-lock the leaf’s parent. This fails since the
parent is r-locked. And the search fails to backtrack due to the x-lock, cf. Figure
6.4.
If a deadlock situation arises, the search process temporarily pauses and then tries
again to r-lock the parent. If it is still not granted the lock, it releases its current
lock and restarts the search.
After the appropriate nodes have been successfully x-locked, the leaf and its parent
are deleted and the sibling’s x-lock is retransformed into a w-lock, cf. Figure 6.5.
Now, the deletion process tracks back, as long as necessary, in order to update the
predecessors’ min fields. To update a min field the deletion process uses x-locks.
After the min field has been updated, the deletion process retransforms the x-lock
into a w-lock and then x-locks the parent. After the min fields have been updated,
all nodes from the root to the sibling are w-locked, cf. Figure 6.5.
The strategy to adjust the min fields does not cause deadlock situations with
search processes.
Then, after the item is located, deleted and the min fields have been updated,
it rebalances the tree. The locking strategy for rebalancing the tree remains the
same as in the insertion process. If, during the rebalancing, a deadlock situation
45
46
Chapter 6. Concurrency Control
arises with a backtracking search process, the search process temporarily pauses
and then tries again to r-lock the parent. If it is still not granted the lock, it
releases its current lock and restarts the search.
In summary, deadlock situations cannot arise between update processes. Further,
when x-locking consecutive nodes, x-locks are taken top-down. Hence, deadlock
situations can only occur with backtracking searches. In case a search process is
not granted to lock the parent, after a second try, it releases its current lock and
restarts the search. Hence, the locking protocol is deadlock-free.
6.3
Relaxed balanced trees
In relaxed balanced data structures, the update processes perform no rebalancing
but leave certain information for separate rebalancing processes, which will later
restore the balance. A rebalancing process can be activated when there are only
few other active processes. Several rebalancers can work concurrently. Since the
update processes do no rebalancing and the separate rebalance operation is divided into several small steps the nodes can be unlocked rapidly. Yet, the tree
may temporarily be out of balance, i.e., its height is not necessarily bounded by
O(log n).
Search processes use r-lock coupling, i.e., a search process r-locks the node to be
visited next before it releases the lock on the currently visited node.
In relaxed balanced structures, it is sufficient to use w-lock coupling during the
search phase of an update operation since rebalancing operations are seperated
from the update operations. Once arrived at a leaf, the w-lock is changed into an
x-lock. In case of an insertion, the leaf is changed into an internal node with two
leaves.
Generally, the deletion process just attaches a removal request and the actual deletion is part of the rebalancing process. Analog to the insertion process, the leaf
is x-locked and the removal request attached before the x-lock is released. If the
deletion process deletes the leaf, cf. Figure 4.2 (a) and (b), the leaf together with
its parent are removed. Additionally, an up-y request is attached to the leaf’s
sibling. To achieve this, a process that will perform a delete operation uses w-lock
coupling during the search phase. When the leaf to be deleted has been found, its
parent and grandparent are still kept w-locked, and the process w-locks the sibling
of the leaf. Then it x-locks the grandparent and parent of the leaf, the leaf itself,
and its sibling. Now the leaf and its parent are deleted, the grandparent points
to the sibling and an up-y request is attached to the sibling. Then, the remaining
locks are released, cf. Figure 6.6.
6.3. Relaxed balanced trees
Figure 6.6: Deletion of a key. Capital X denotes the nodes that are deleted.
Figure 6.7: Deletion of a key by exchanging contents of nodes. Capital X denotes
the leaf that is to be deleted.
If structural changes are implemented by exchanging the contents of nodes, the
process keeps the parent P of the leaf that is to be deleted w-locked, and then
w-locks the sibling of the leaf as well as its nephews, if any. Then it x-locks the
nodes top-down. If the sibling node is a leaf, the parent must be made to a leaf.
Then, the content of the sibling is copied into the parent node and the parent
pointers of the nephews, if any, are switched to point to P . Furthermore, an up-y
request may have to be attached to P , cf. Figure 6.7. Finally, the leaf and its
sibling are deleted and the remaining locks are released.
A rebalancing process w-locks the nodes while checking whether a transformation
applies. If a transformation applies, it changes all w-locks to x-locks. Nurmi and
Soisalon-Soininen suggest to traverse the tree nondeterministically in order to locate the rebalancing requests. This implies that the top-most node can be locked
first among all nodes that have to be considered. Hence all locks are taken topdown and the locking scheme is deadlock free.
If a problem queue is used to locate the requests, the situation is different. In order
to apply a transformation, the parent or grandparent of a node with a rebalancing
request have to be considered. These are w-locked bottom-up. Hence, the scheme
47
48
Chapter 6. Concurrency Control
must be extended in an appropriate way, cf. Hanke [6]: when a rebalancing process
tries to w-lock its parent that is w- or x-locked by another process, the rebalancer
immediately releases all locks it holds. The rebalancing processes are the only
processes that also w-lock bottom-up. Since they immediately release all locks if
a requested lock cannot be granted, deadlock situations are eliminated.
If the topmost node, that is relevant for the transformation is successfully locked,
then all other relevant nodes are w-locked top-down. Along the way, we check if a
transformation can be applied. If no further node can be locked and no transformation can be applied, then all locks are released. If a transformation can be applied,
then the w-locks are changed to x-locks top-down. Then, the transformation is
applied, and if necessary, a newly generated rebalancing request is appended to
the respective queue, and all locks are released.
In the following we extend the locking scheme by Hanke such that it can be applied
to the RMART.
6.3.1
Concurrent RMART
Search processes use r-lock coupling, also when they track back. The protocol is
the same as in the MART. The difference is that the MART only has to track back
once, whereas the RMART may have to track back several times until a leaf with
a valid y-coordinate and without removal request has been found.
Since the insertion process modifies the y-fields top-down, it uses x-locks in addition to w-lock coupling. It tries to get the lock until it receives it. It first
w-locks and then x-locks a node v. After the y-field has been updated, the x-lock
is changed into a w-lock. Then, it tries to w-lock the appropriate son. If this is
successful, it releases v’s lock and x-locks the son. Once arrived at a leaf, it inserts
the new item and then releases the x-lock. Deadlocks among search and insert
processes cannot occur.
The protocol for delete operations is similar as described for relaxed balanced
trees. The difference is that if the deletion process fails to x-lock the appropriate
nodes top-down, i.e., in case it does not just attach a removal request but deletes
the leaf, then it transforms all x-locks into w-locks, temporarily pauses and then
tries again to x-lock the nodes. This avoids deadlock situations with backtracking
search processes.
The protocol for the rebalancing processes has to be extended such that it releases
all locks if it fails to x-lock top-down. Again, this avoids deadlock situations with
backtracking search processes and hence the entire locking scheme is deadlock-free.
6.3. Relaxed balanced trees
In order to carry out a plausibility consideration, which corroborates the correctness of the proposed locking schemes, we will present an interactive animation of
the relaxed balanced min-augmented range tree. Further, the animation will foster
a deeper understanding of the various locking strategies.
49
Chapter 7
Interactive Visualization of the
RMART
According to Price, Small and Baecker “program visualization is the use of graphics to enhance the understanding of a program” [59]. Stasko and Domingue define
the term program visualization as “the visualization of actual program code or
data structures in either static or dynamic form” [60]. An early example of a
static visual representation of a program are flowcharts, introduced by Goldstein
and von Neumann [61]. The first dynamic representation was probably Knowlton’s animation of dynamically changing data structures in Bell Lab’s low-level
list processing language [62] [63].
In this chapter we will present an interactive animation of the RMART. We visualize how the RMART evolves over time as well as the acquisition and release
of locks while the search, insert, delete and rebalancing processes operate on the
tree. The animation shows how the RMART is traversed concurrently by the various processes and the types of locks they use. In order to better understand the
interactions of the processes, the user can interactively, i.e., at runtime of the animation, control the number and type of processes currently operating on the tree.
Since the animation focuses on the presentation of the main concepts, it does not
show in detail how the red-black properties are restored. When a rotation occurs,
the visualization shows the resulting data structure after the rotation. Red-black
tree animations that visualize the operations step-by-step can be found in [64] [65].
Execution in concurrent programs may be non-deterministic. That is, multiple
executions of the same program may result in varying program behaviors. This
makes it difficult to evaluate the correctness of concurrent programs, as deadlock
situations may occur only once in hundreds or thousands of executions. The visualization can be seen as a further indication that the proposed locking scheme
is deadlock-free. Of course, it is not a proof that the locking scheme actually is
deadlock-free.
52
Chapter 7. Interactive Visualization of the RMART
This animation was developed in collaboration with Waldemar Wittmann in line
with his bachelor thesis [20]. In the following we will describe the application
framework, the underlying architecture as well as the graphical user interface.
7.1
Application framework
The visualization has been developed utilizing Qt. Qt is a cross-platform application framework which includes an intuitive class library, integrated development
tools, support for C++ and Java development (Qt Jambi) as well as desktop and
embedded development support. As a prominent example, the K Desktop Environment (KDE), a contemporary desktop environment for UNIX systems, has
been developed with Qt. In the following we will describe the architecture of the
visualization.
7.2
Architecture
The visualization uses a model-view-controller (MVC) architecture. The MVC
paradigm is an architectural pattern that is often used when building user interfaces. The model represents the data and functionality of the application.
The view attaches to the model and renders its contents. In addition, when the
model changes, the view redraws the affected part to reflect those changes. The
model-view separation makes it possible to display the same data in several different views, and to implement new types of views, without changing the underlying
data structures.
The controller processes and responds to events, typically user actions such as
keystrokes, and may invoke changes on the model.
In Design Patterns, Gamma et al. [66] write:
MVC consists of three kinds of objects. The Model is the application
object, the View is its screen presentation, and the Controller defines
the way the user interface reacts to user input. Before MVC, user interface designs tended to lump these objects together. MVC decouples
them to increase flexibility and reuse.
This clearly expresses the advantage of the separation of the three components.
Qt Jambi provides the abstract class QTreeModel to create custom models that
represent tree structures. It further provides a ready-to-use implementation of a
tree view: QTreeView. QTreeModel defines an interface that is used by QTreeView
to access data.
The model and the view communicate via the powerful “signals and slots” mechanism: a signal is emitted when a particular event occurs. A slot is a function that
7.3. The graphical user interface
is called in reponse to a particular signal.
When the data changes, e.g., when a lock is acquired, when a node is inserted or
deleted, when the color changes or when a rotation occurs, the model emits a signal
to inform the view about the change. In response, the view redraws the affected
part. Further, the controller processes signals emitted from the user interface, e.g.,
when the user increases the number of insertion processes.
A grand advantage of making use of the MVC framework to implement a visualization of the RMART is that only few modifications to the existing source code
had to be made. This in turn minimized potential errors in the implementation of
the visualization. The visualization itself in turn proved to be a useful graphical
debugging tool. In the following we will describe the graphical user interface.
7.3
The graphical user interface
Figure 7.1 shows a screenshot of the visualization. The main window displays
the RMART data structure in an explorer-like style. The top-level item (numeral
string) represents the root. Nodes with the same indentation and connected by
a vertical line are siblings: the top node represents the right child of the parent,
the bottom node represents the left child. When traversing the tree, each process
changes the locks at each visited node. A numeral string represents the locks
currently set at a node: (r w x). For example, (2 0 0) at a node denotes that
the node is r-locked by two processes, and is neither w- nor x-locked. A node can
be r-locked by several processes, but can be w- and x-locked by only one process
at a time. A node can be x-locked only if the node is not r-locked by another
process. An x-lock can only be obtained via a w-lock, i.e., before a process can
x-lock a node, it must have obtained a w-lock first. After the x-lock was granted,
the process can release the w-lock.
In order to better perceive the nodes that are being locked, the view highlights
the corresponding lock. Each lock has a different color. When a lock is set, the
appropriate color flashes up. When the lock is released, the highlight extinguishes.
This makes it easy to track the various processes while they traverse the tree. For
example, the root in Figure 7.1 is r-locked by two processes, highlighted in yellow.
One of these processes already r-locked the left son. (Search processes use r-lock
coupling. The lock of the current node is released only when the lock of the child
has been aquired.) Further, the root and its left son are w-locked by an insert
process, highlighted in cyan. The delete process performs an actual deletion of a
leaf indicated by three x-locked nodes, highlighted in magenta: the leaf that is to
be deleted, its sibling and its parent.
Initially, the RMART is empty, i.e., consists of one leaf. In the Settings window, the
user can interactively vary the number of search, update and rebalancing processes
that operate on the tree. When the user decreases a number, then the process(es)
cannot be killed straight away, but first have to finish their task. Hence, the number
53
54
Chapter 7. Interactive Visualization of the RMART
Figure 7.1: RMART visualization. Two search, one insert and one delete process.
next to the interactive field displays the actual number of active processes. As soon
as the appropriate number of processes have been stopped, the number is identical
to the one in the interactive field.
7.3. The graphical user interface
Using the slider in Zoom, the user can zoom in and zoom out of the tree. By
zooming out, it is possible to get a better overview of the activities in the tree
when the tree is quite large.
Via the Insert button, the user can insert a specified number of nodes on the
spot, i.e., without animation. For example, this feature can be used to generate
an initial tree of a specific user-defined size. The preset value is 100.
The Queue Size window displays the current size of the various queues, e.g., the
number of remaining queries in the search queue, or the number of up-in requests.
Checking the Show more node contents box, the view renders additional node
information, namely the types of attached rebalancing requests, if the node has
any. The up-in request is highlighted in red, up-out in green, removal in black, and
up-y in blue. The view further displays the color of each node as textual string.
Figure 7.2 shows the tree after a number of insertions, and the attached up-in
requests. After all up-in requests have been settled, the red-black tree constraints
Figure 7.2: RMART visualization. A number of up-in requests due to insertions
of nodes.
are satisfied, i.e., if a node is red, then both its children are black and every path
55
56
Chapter 7. Interactive Visualization of the RMART
Figure 7.3: All up-in requests shown in Figure 7.2 are settled.
from the root to a leaf contains the same number of black nodes, cf. Figure 7.3.
Note that this tree is more balanced.
When additionally delete operations are performed, then removal, up-out and up-y
requests may also be generated. Figure 7.4 captures the handling of a removal as
well as an up-out request. All relevant nodes are x-locked (highlighted in magenta).
There are three remaining removal as well as one up-out request in the queues.
The figure shows one more of each type attached to nodes. This is due to the fact
that of each type, one request has already been taken out of the corresponding
queue and is currently being handled.
Figure 7.5 shows one removal and one up-out request rebalancer operating on the
tree. The leaf that has an up-out request has only three black nodes (the leaf
included) on the path to the root, all other leaves have four black nodes. The
removal rebalancing process attaches an up-out request to the parent of the node
that is to be deleted. This is because the parent and the sibling are also black.
Further, an up-y request is attached and the leaf and its sibling are deleted, cf.
Figure 7.6. Here, it can be seen that the new leaf indeed needs an up-out request:
7.3. The graphical user interface
Figure 7.4: One search process as well as one removal and one up-out rebalancing
processes operate on the tree.
there are three black nodes on the path, instead of four. Furthermore, the up-out
rebalancer settled the up-out request highlighted in Figure 7.5. To this end, one
double rotation (right-left) and one recoloring were performed, cf. Figure 4.3(d).
Further, the up-y requests that were attached at those nodes which were involved
in the double rotation have been pulled to the root of the subtree, cf. Figure 7.6.
Using the slider in Speed, the user can control the speed and also freeze the execution using the pause button. When the pause button is pressed again, the
execution continues.
This interactive animation was presented to illustrate the functioning of the search,
update and rebalancing processes and to corroborate the correctness of the pro-
57
58
Chapter 7. Interactive Visualization of the RMART
Figure 7.5: One removal and one up-out request rebalancer operate on the
RMART. The removal rebalancer attached an up-out request to the parent of
the node that is to be deleted. The up-out rebalancer x-locks the appropriate
nodes top down.
posed locking schemes. Clearly, it is not a proof that the locking strategies are
deadlock-free, yet it is an indication. A further indication for correctness was furnished by the Java PathFinder (JPF), an explicit state software model checker [67].
Both the strictly and relaxed balanced min-augmented range tree can be queried
and modified concurrently by several processes. The difference is that updates
must be performed serially in the strictly balanced MART, since the rebalancing
is performed immediately after an update. Yet, an update can be performed
concurrent with search operations. In the relaxed balanced MART, updates can
be performed concurrently. Further, updates can be performed concurrently with
search as well as with restructuring operations. Hence, the vantage of the relaxed
7.3. The graphical user interface
Figure 7.6: The removal and up-out requests in Figure 7.5 are settled. Further, the
removal and up-out rebalancers retrieved the next request from the appropriate
queues.
balanced min-augmented range tree over the strictly balanced version is expected
to become noticeable in the presence of update bursts.
Although route updates typically occur on the order of tens or hundreds of times
per second, transient bursts may occur at rates that are orders of magnitude
higher [68] [69]. The update rate is critical because a router that cannot keep
up with all of the updates may trigger a condition known as route flap, in which
a router processing a backlog of route updates is incorrectly marked as being
unreachable by other routers. This state change creates a domino effect of route
updates that can cripple a network [68]. Such worse cases have already been
observed in practice [70].
Further, the relaxed balancing scheme is expected to reduce lookup latency if
lookups and updates “meet in the same area”.
59
60
Chapter 7. Interactive Visualization of the RMART
The above deliberations suggest that a relaxed scheme is better suited, yet it is not
clear if these expected privileges compensate the higher coordinational effort which
is due to the uncoupling of updates and rebalancing tasks. In the following we
investigate by experiment if the relaxed balanced min-augmented range tree has an
advantage over the standard min-augmented range tree in a dynamic concurrent
environment.
Chapter 8
Experimental Results
Routers in different ASs use the Border Gateway Protocol (BGP) to exchange
network reachability information. Each BGP speaker maintains a set of tables
(Routing Information Bases, or RIBs) - one for each BGP neighbour (AdjacencyRIB-INs) and one for its own internal use for forwarding. It selects the “best” of
these routes to use for its local forwarding decisions (Local-RIB), and sends a copy
of this best route to all its peers (Adjacency-RIB-OUTs) [71].
Each incoming update message causes a change in the corresponding AdjacencyRIB-IN. If the information is a prefix withdrawal, then a comparison needs to be
made with the Local-RIB. If there is a match, then all other Adjacency-RIB-Ins
need to be scanned and a new best route installed into the Local-RIB, as well as
loading new announcement messages in the Adjacency-RIB-OUTs to reflect this
local change of best path. If there are no other candidate routes in the other RIBINs then the route is withdrawn from the Local-RIB and a withdrawal message is
passed to the BGP speaker’s peers [71].
If the incoming update message is an announcement, then the BGP engine has
to update the Adjacency-RIB-IN and then compare this route to the current best
path in the Local-RIB. If this new route represents a better path, then the LocalRIB is updated and announcement messages are queued in all the Adjacency-RIBOUTs [71].
To conduct our experiments, we use real and up-to-date routing data that is supplied by the Advanced Network Technology Center at the University of Oregon
within the framework of the Route Views Project [72]. The Route Views routers
archive their BGP routing table snapshots and the BGP updates received from
their peers. RIB table dumps are collected every two hours.
62
Chapter 8. Experimental Results
8.1
The MRT format
Researchers and engineers often wish to analyze network behavior by studying
routing protocol transactions and routing information base snapshots. To this
end, the MRT format was developed to encapsulate, export, and archive this information in a standardized data representation [73]. The format was developed in
concert with the Multi-threaded Routing Toolkit (MRT) at Merit in the mid-90’s.
It is employed by RIPE RIS [74] and Routeviews BGP routing data collectors.
In our benchmark, we use routing data collected from the “route-views2.oregonix.net router”.1 The following example MRT record is taken from the April 13,
2007 routing information base snapshot.
T IM E : 04/13/07 08 : 40 : 29
T Y P E : T ABLE DU M P/IN ET
V IEW : 0
SEQU EN CE : 2
P REF IX : 3.0.0.0/8
F ROM : 134.55.200.31 AS293
ORIGIN AT ED : 04/12/07 03 : 29 : 26
ORIGIN : IGP
ASP AT H : 293 701 703 80
N EXT HOP : 134.55.200.31
COM M U N IT Y : 293 : 14293 : 46
ST AT U S : 0x1
The PREFIX entry contains the IP address of a particular routing table dump
entry. ORIGINATED contains the time at which this prefix was heard. The
NEXT HOP entry determines the next hop address. For a full description of the
MRT format see [73].
Multiple networks may have the same path and path attributes. In that case,
specifying multiple network prefixes in the same update message is more efficient
than generating a new one for each network. Hence, an update message can advertise, at most, one set of path attributes, but multiple destinations, provided that
the destinations share these attributes. Furthermore, an update message can list
multiple routes that are to be withdrawn from service.
In the following message, the fields are the attributes from a single BGP update
message which announces a destination.
T IM E : 04/13/07 08 : 42 : 20
1
The
peers
of
the
route-views2.oregon-ix.net
router
http://www.routeviews.org/peers/route-views2.oregon-ix.net.txt
can
be
found
at
8.2. Flow characteristics in internetwork traffic
T Y P E : BGP 4M P/M ESSAGE/U pdate
F ROM : 66.185.128.1 AS1668
T O : 128.223.51.102 AS6447
ORIGIN : IGP
ASP AT H : 1668 1299 9121
N EXT HOP : 66.185.128.1
M U LT I EXIT DISC : 1026
AN N OU N CE
81.213.47.0/24
The following message lists a multiple of routes that are to be withdrawn from
service.
T IM E : 04/13/07 08 : 42 : 21
T Y P E : BGP 4M P/M ESSAGE/U pdate
F ROM : 206.24.210.99 AS3561
T O : 128.223.51.102 AS6447
W IT HDRAW
202.136.176.0/24
58.65.1.0/24
202.136.182.0/24
Update files contain the update messages and are rotated every 15 minutes, i.e.,
each update file contains prefix withdrawals and announcements that occur within
a 15 minutes time interval.
We expect that the relaxed version shows better behavior in the presence of update
bursts. In our experiments, various scenarios with varying update frequencies
are simulated. A high update frequency corresponds to a short interarrival time
between routing updates.
The update file of April 13, 2007 at 10:27 contains a 9 seconds interval with (nearly)
peak update rates each second (measured over one month). The interval starts at
10:36:24 and ends at 10:36:32 with 85961 prefix withdrawals and announcements
all in all, cmp. Figure 8.1.
8.2
Flow characteristics in internetwork traffic
Flows are considered to be sequences of packets with an n-tuple of common values
such as source and destination address prefixes, protocol and port numbers, ending
after a fixed timeout interval.
63
64
Chapter 8. Experimental Results
16000
Number of Updates
14000
12000
10000
8000
6000
4000
2000
0
10:36 10:36 10:36 10:36 10:36 10:36 10:36 10:36 10:36
:24
:25
:26
:27
:28
:29
:30
:31
:32
Number of Updates 9096 12325 15792 11190 7409 5879 7959 7343 8968
Time
Figure 8.1: Excerpt from the update file of April 13, 2007 at 10:27.
8.2.1
Locality in internetwork traffic
Several studies have identified the presence of locality in internetwork traffic [75]
[76] [77] [78]. Temporal locality in network address traces refers to the phenomenon
that if an address is referenced, it is likely to be referenced again in the near future.
The reason for the existence of this temporal locality lies in the fact that packets
with the same destination tend to be transmitted closely in time, usually as the
result of transmission of data that are segmented into a sequence of packets [75].
A trace of references to IP addresses has high temporal locality if a large portion
of the repeated references have a short interarrival time, i.e., there is a large probability of rereferencing the same IP address in a short period of time.
MacGregor and Chvets [77] analyzed traces that were captured at the University
of Auckland, the University of Alberta and at the San Diego Supercomputer Center commodity connection. The analysis showed that approximately 80% of all
references have interarrival times less than 0.2 seconds.
8.2.2
Statistical properties of flows
Chabchoub et al. [79] study the size of flows in a limited time window of duration
∆. The reason for considering short time windows is that in short time intervals,
volumes of flows exhibit only one major statistical mode.
8.3. Generation of sequences of operations
The key observation when characterizing a traffic trace is the fact that if the
duration ∆ of the successive time intervals used for computing traffic parameters
is appropriately chosen, then the distribution of the size of the main contributing
flows in the time interval can be represented by a Pareto distribution and therefore
exhibits a unimodal behavior. More precisely, there exist ∆, Bmin , Bmax and a > 0
such that if S is the number of packets transmitted by a flow during ∆, then
P (S ≥ x|S ≥ Bmin ) = (Bmin /x)a , for Bmin ≤ x ≤ Bmax .
The parameter Bmin is usually referred to as the location parameter and a as the
shape parameter. In other words, if the time interval is sufficiently small then the
distribution of the number of packets transmitted by a long flow has one dominant
Pareto mode and therefore can be characterized in a robust way.
The quantity Bmin defines the elephants of the corresponding trace. An elephant
is a flow such that its number of packets during a time interval of length ∆ is
greater than or equal to Bmin . By definition of Bmax , flows whose size is greater
than Bmax represent a small fraction of the elephants. A mouse is a flow with a
number of packets less than Bmin .
It should be noted that the parameters computed in a time window of length ∆ do
not give a complete description of the distribution of the total number of packets
of a flow, since statistics are done over a limited horizon. To obtain information
on the total number of packets, it is necessary to glue the statistics from successive
time windows of length ∆. Chabchoub et al. [79] leave this as an open problem.
8.3
Generation of sequences of operations
Random samples from a Pareto distribution can be generated using inverse transform sampling. Given a random variate U drawn from the uniform distribution
on the unit interval (0; 1), the variate T = BUmin
1/a is Pareto-distributed [80].
It turns out that for commercial traffic, the value of Bmin is close to 20, Bmax close
to 94, and a close to 1.85 [79].2
Due to unavailability of public IP traces associated with their corresponding routing tables, we synthetically generate a trace file using the 10:36:23 routing information base snapshot and the parameters above. The latest RIB snapshot provided
prior to 10:36:24 is the 08:42 snapshot. We generate the 10:36:23 snapshot starting with the available 08:40 RIB and perform the updates that took place between
08:42 and 10:36:23. (BGP routers typically receive multiple paths to the same
destination. We arbitrarily select one to be installed as best route.) The resulting
RIB contains 235620 entries.
To generate the traces, we randomly select a point within the range of each route
table entry, this ensures that a query is covered by at least one filter in the set,
2
Parameters for traces from the France Telecom (FT) network
65
66
Chapter 8. Experimental Results
Figure 8.2: Excerpt from the generated trace file.
hence default filter matches are avoided. Each query is multiplied T times. Yet,
packets of a same flow are not back-to-back but mixed with packets of other flows.
In order to interleave the various flows, we randomize the entries within a fixed
sized window which moves over the trace file. We choose the window length such
that packet interarrival times tend to be small. Figure 8.2 shows an excerpt of
the generated trace file with the temporal locality characteristics. Let the arrival
time of the packets be consecutive numbers. Then, the interarrival time of two
consecutive identic references is one time unit. There are 14 unique IP addresses,
of which three addresses, e.g., 5218394330100989951, have interarrival time of one.
In router tables, IP lookup is intermixed with updates. To generate sequences
consisting of searches, insertions and deletions, the searches were randomly dispersed in the sequence of updates (from the update file of April 13, 2007 starting
at 10:36:24) such that the original update sequence and search sequence were
maintained.
8.4. Test setup
8.4
Test setup
The experiments are performed on a Sun Fire T2000 Server 1.0 GHz UltraSPARC
T1 Processor with six cores. It contains 8 GB DDR2 main memory and 3MB
Level-2 cache. It has a Unified Memory Architecture (UMA), i.e., memory is
shared among all six cores. It contains multiple physical instruction execution
pipelines (one for each core), with several active thread contexts per pipeline or
core. Each core is designed to switch between up to four threads on each clock
cycle. Threads that are stalled, e.g., those threads waiting for a memory access,
are skipped. As a result, the processors execution pipeline remains active doing
real useful work, even as memory operations for stalled threads continue in parallel [81].
We concurrently perform a sequence of dictionary operations on a balanced standard red-black tree which is built by inserting the 10:36:23 RIB snapshot into an
empty tree. This snapshot represents the Local-RIB.
The test environment consists of three classes:
• The main programm which starts the test thread.
• The test thread which builds the RIB snapshot and launches a variable number of tree threads.
• A tree thread performs the operations insert, delete and search on the snapshot.3
In the RMART as well as in the MART, the test thread remains active during the
execution of the test (the main program terminates after it has started the test
thread).
In the case of strict balancing, 2, 3, 4, 5 or 6 tree threads are concurrently active.
The tree threads which perform the dictionary operations, plus the test thread. In
the case of relaxed balancing, the four rebalancing processes, which get their work
from the appropriate problem queues, are additionally started.
The generated sequence consisting of searches and updates is stored in a linked
queue which manages concurrent access. Each tree thread takes an entry from
the sequence, performs the operation and then retrieves the next entry from the
queue. The queue guarantees that each element is taken out exactly once.
To benchmark the MART and RMART, we measure the total time needed to perform the sequence of search and update operations as well as the average time of an
insert, delete and search operation. The time is measured with System.nanoTime
which is provided in the java.lang package. To measure the total time of execution
we take the minimum start time of all tree processes, the maximum end time and
compute the difference, cf. Figure 8.3 for an example.
3
We solely perform the operations on the Local-RIB.
67
68
Chapter 8. Experimental Results
Figure 8.3: The total execution time to perform a sequence of operations. In this
example, the operations are performed by four threads.
The benchmark considers the average time of the total time as well as the average
time per insert, delete and search operation over three diverse sequences.
In Java, it is not possible to bind a thread to a processor. We assumed that the
Java Virtual Maschine spreads the threads equally to the six available processors.
A router shall perform the rebalancing tasks in less busy times. In order to simulate
this, the rebalancing processes sleep a random amount of time between 0 and 5999
milliseconds.
Before we benchmark the RMART and the MART in a concurrent environment, we
examine how the total time behaves when the sequence of operations is performed
sequentially. Therefore, we perform a sequence of one million dictionary operations on the MART. Since the underlying architecture switches between threads,
and the next operation may only be performed when the current operation is complete, this time elapses without continuing with actual work, and the total time
ascends with the number of parallel threads, cf. Figure 8.4.
8.5
Comparison of the RMART and the MART
We simulate various scenarios with varying update frequencies based on the 10:36:23
snapshot. The snapshot is represented by a balanced standard red-black tree which
is built by inserting the 10:36:23 snapshot into an empty tree. We repeat each test
three times and compute the average total time as well as the average time per
operation respectively.
Both trees yield correct answers to lookup queries in a given sequence of operations, since in both trees the appropriate nodes are locked when performing a
structural change.
The focus of this benchmark lies on measuring the advantage of relaxed balancing
over instantaneous rebalancing. We do this by comparing the respective performance results of both trees among the same number of running tree processes. The
imeRM ART
)%. For example, suppose
performance gain constituts 100 − (100 × totalT
totalT imeM ART
the total time to perform a sequence of operations is 9 seconds in the RMART
69
8.5. Comparison of the RMART and the MART
MART
45
40
Total time [s]
35
30
25
20
15
10
5
0
2
3
4
number of processes
Figure 8.4: Total execution time to perform a sequence of 1000k insert, delete and
search operations sequentially.
and 10 seconds in the MART in the case of four processes. Then the performance
gain is 10%.
Both trees are expected to be on par in the absence of routing updates. The
following simulation will confirm this hypothesis.
8.5.1
Solely lookups
To conduct this test, we use the generated trace file and execute one million lookups
on the 10:36:23 snapshot. In this setting, both trees have an equal amount of active processes since the rebalancing processes are not started.
The average time per search operation as well as the total time are on par in both
trees. The total time to perform the lookups decreases in both trees, see Figure
8.5(a). This is because nodes can be r-locked by an arbitrary number of processes.
In both trees, there is a 44% advantage in the case of five processes compared to
the case of two executing processes.
The average time to perform a search operation increases in both trees, see Figure 8.5(b). The higher the number of parallel processes, the higher the chance
that processes must wait to be able to enter the synchronized method in order
to r-lock the same node. This phenomenon is also known under the phrase “lock
contention”. Hence, the average time for a single search operation increases.
The results show that the total time decreases until eight threads (Figure 8.5(a)
70
Chapter 8. Experimental Results
MART
RMART
MART
35
Average time [µs]
Total time [s]
RMART
13
12
11
10
9
8
7
6
30
25
20
2
3
4
5
Number of processes
(a) Total time to perform 1000k search operations.
2
3
4
5
Number of processes
(b) Average time per search operation.
Figure 8.5: Execution time. Solely search operations.
only shows up to five threads), even though 24 active execution threads are supported on this machine (six cores and each core is designed to switch between up
to four threads on each clock cycle). When performing the same scenario without
using r-locking (in this scenario it is not necessary that processes use r-locks since
no updates are performed), the total time descends until 17 processes. These results suggest that lock contention poses a significant scalability impediment.
Applications in Java are compiled to target the Java Virtual Machine. However,
such applications compiled to this virtual machine’s instruction set (called Java
byte codes) usually run on a processor either through an interpreter or through
just-in-time (JIT) compilation [82]. One problem with conventional techniques for
executing synchronized Java methods is that several time-consuming operations
have to be performed in order to execute the synchronized “statement”. Moreover, these operations are performed once when the monitor is acquired and then
have to be repeated again in order to release the monitor [83]. Hence, thread
synchronization significantly adds to the execution time of many programs [84].
Performance can be enhanced by tailoring a microprocessor to the Java computing environment, e.g., by providing hardware support for garbage collection and
thread synchronization [82]. But still, synchronization depends on the support of
the operating system [82]. Synchronization overhead can be further reduced when
the Java byte codes are synthesized directly to hardware. Section 8.7 outlines a
technique how byte codes can be converted to a hardware description language.
In the following scenario, we solely perform update operations.
71
8.5. Comparison of the RMART and the MART
RMART
MART
250
Average
Insert time [µs]
Total time [ns]
5.E+09
4.E+09
3.E+09
2.E+09
200
150
100
1.E+09
50
0.E+00
2
3
RMART
MART
4
2
3
4
51.668
69.482
65.54
145.077
Number of processes
91.4
233.98
Number of processes
(a) Total time to perform 85961 update operations.
(b) Average time per insert operation.
Average Delete time [µs]
400
350
300
250
200
150
100
RMART
MART
2
3
4
118.106
198.904
114.175
313.17
Number of processes
150.083
363.719
(c) Average time per delete operation.
Figure 8.6: Time to perform 85961 update operations.
8.5.2
Solely updates
The updates are taken from the update file of April 13, 2007 at 10:27 and start at
10:36:24 and end at 10:36:32 with 85961 updates all in all.
In the RMART, the total time to perform the sequence of update operations decreases until three parallel tree threads, see Figure 8.6(a). In the MART, the total
time ascends. The performance gain is 23% in the case of two, 54% in the case of
three processes and 59% in the case of four processes.
The average time of an insert and delete operation ascends with the number of
parallel threads in both trees, cf. Figure 8.6(b) and 8.6(c). In the MART, only
one process may update the tree at a certain time, all other processes cannot enter
the tree (the root is kept w-locked) and must wait until the update operation is
complete. Hence, the average time raises with the number of parallel threads.
In the RMART, many processes may update the tree concurrently at different
locations in the tree. The more the number of parallel threads, the higher the
chance that update processes meet at the same location in the tree. Hence, the
72
Chapter 8. Experimental Results
average time raises with the number of parallel threads. The performance gain
when performing an update operation in the RMART compared to the MART is
considerable and increases with the number of parallel processes. In case of insertions, it varies from 26% in the case of two processes to 61% in the case of four
processes. In case of deletions, it varies from 41% in the case of two processes to
59% in the case of four processes.
The faster a table update is being settled, the faster a correct view of the current
network topology is being established.
In the following simulations we perform IP lookup and table update operations
concurrently and examine the impact of varying update frequencies on the performance.
8.5.3
Various update frequencies
We simulate the update frequencies via the amount of interspersed lookups in the
original update sequence (the original orders of updates and searches are maintained). The first scenario performs 100k, the second 500k and the last scenario
1000k operations all in all.
First scenario
All in all, 100k operations are performed and hence 14039 lookups are interspersed.
In this scenario, the 85961 updates constitute 85.9%.
Only in the case of one process does the MART terminate earlier than the RMART.
In the MART, the lookups are performed concurrently, but due to the high update
rate, the total time ascends, cf. Figure 8.7(a). In the case of two parallel processes
the RMART needs 35 % less time than the MART. This increases to 63 % in the
case of four processes.
The performance gain when performing a search operation in the RMART compared to the MART increases with the number of processes. It varies from 22% in
the case of two processes to 27% in the case of four processes, cf. Figure 8.7(b).
This supports the hypothesis that in the MART, a search operation is delayed due
to instantaneous rebalancing. In relaxed balancing, the rebalancing operations are
postponed to less busy times and hence searches are not as much delayed.
The performance gain when performing an insert operation varies from 38% in the
case of two processes to 66% in the case of four processes, cf. Figure 8.7(c). When
performing a delete operation it constitutes 67% in the case of four processes, cf.
Figure 8.7(d).
73
8.5. Comparison of the RMART and the MART
55
6
50
Average Search time [µs]
7
Total time [s]
5
4
3
2
1
0
RMART
MART
40
35
30
25
20
2
3
4
2.60586
4.03869
2.31174
5.70736
Number of processes
2.30124
6.22572
(a) Total time.
RMART
MART
300
500
250
450
200
150
100
50
3
4
31.361
42.828
Number of processes
39.325
53.677
400
350
300
250
200
150
0
RMART
MART
2
25.286
32.267
(b) Average time per search operation.
Average Delete time [µs]
Average Insert time [µs]
45
2
3
4
51.501
83.683
68.308
184.64
Number of processes
93.707
273.164
(c) Average time per insert operation.
100
RMART
MART
2
3
4
103.131
197.478
127.735
318.782
Number of processes
155.968
468.29
(d) Average time per delete operation.
Figure 8.7: Execution time to perform a sequence of 100k insert, delete and search
operations. The update rate is 85.9%.
Second scenario
In this scenario, the updates constitute 17%. All in all, 500k operations are performed and hence 414039 lookups were interspersed.
In the MART, the lookups are performed concurrently. Due to the moderate update rate, the total time levels off, cf. Figure 8.8(a). The performance gain adds
up to 31 % in the case of four processes.
The average time per search operation is visualized in Figure 8.8(b). The performance gain adds up to 14 % in the case of four processes.
When performing an insert operation it adds up to 51% in the case of four processes, cf. Figure 8.8(c), and up to 59% in the case of a delete operation, cf. Figure
8.8(d).
Third scenario
In the last scenario, 1000k operations are performed and hence 914039 lookups
were interspersed. The updates constitute 8.6%.
74
Chapter 8. Experimental Results
10
38
Average Search time [µs]
36
Total time [s]
8
6
4
2
34
32
30
28
26
24
22
0
20
RMART
MART
2
3
4
7.81801
8.1968
6.05761
7.4579
Number of processes
5.55424
8.07405
RMART
MART
3
4
27.264
29.154
32.236
37.543
Number of processes
(a) Total time.
(b) Average time per search operation.
170
300
Average Delete time [µs]
Average Insert time [µs]
2
23.931
24.136
150
130
110
90
70
50
RMART
MART
2
3
4
54.15
61.085
64.855
105.119
Number of processes
87.555
177.382
(c) Average time per insert operation.
250
200
150
100
RMART
MART
2
3
4
106.881
114.067
121.662
211.601
Number of processes
130.528
314.951
(d) Average time per delete operation.
Figure 8.8: Execution time to perform a sequence of 500k insert, delete and search
operations. The update rate is 17%.
Due to the relative low update rate, the total time also decreases in the MART,
cf. Figure 8.9(a). The performance gain only adds up to 11% in the case of four
processes.
The average time per search operation is on par in both trees, cf. Figure 8.9(b).
The performance gain only adds up to 5% in the case of four processes.
When performing an insert operation it constituts 31% in the case of four processes,
cf. Figure 8.9(c). The performance gain when performing a delete operation adds
up to 36% in the case of three, and 41% in the case of four processes, cf. Figure
8.9(d).
8.5.4
Résumé of experimental results
In the experiments, scenarios with varying update frequencies were simulated.
If solely search operations are performed, the total time of the MART is on a
par with the total time of the RMART. If also insert and delete operations are
75
8.5. Comparison of the RMART and the MART
15
35
Average Search time [µs]
14
Total time [s]
13
12
11
10
9
8
RMART
MART
30
25
20
2
3
4
14.73932
14.5188
11.0807266
11.4548
Number of processes
9.4920708
10.658857
(a) Total time.
RMART
MART
2
3
4
25.013
24.46
27.841
27.988
Number of processes
31.492
33.026
(b) Average time per search operation.
Average Delete time [µs]
Average Insert time [µs]
250
110
90
70
50
RMART
MART
2
3
4
55.897
57.727
66.047
79.737
81.672
118.746
Number of processes
(c) Average time per insert operation.
200
150
100
RMART
MART
2
3
4
113.656
114.487
115.838
181.395
Number of processes
140.117
235.643
(d) Average time per delete operation.
Figure 8.9: Execution time to perform a sequence of 1000k insert, delete and search
operations. The update rate is 8.6%.
performed, then the higher the update/lookup ratio, the clearer does the relaxed
MART outperform the standard MART. Table 8.1 summarizes the results for the
various tested scenarios when performing a sequence of search, insert and delete
operations.
Further, the results have shown that the performance gain per search operation
grows with the update/lookup ratio. If only search operations are performed, the
average time per search operation is on a par in both trees. The higher the ratio,
and the higher the number of processes, the clearer the difference in the average
performance per search operation. This confirms the hypothesis that in the relaxed
balanced min-augmented range tree, lookup queries are not as much delayed as in
the standard version, since the rebalancing operations are postponed.
The smaller the periods of lookup latency, the smaller the chance that packets are
being dropped due to buffer overload. Thus another advantage is that the rate of
packet loss might be reduced.
The simulations were performed on a six core unified memory architecture. Here,
the total time of the RMART descended until a maximum of four threads. In the
76
Chapter 8. Experimental Results
% updates
8.6
17
86
100
number of processes
2 3
4
3
11
5 19
31
35 59
63
23 54
59
Table 8.1: Performance gain in terms of total execution time when performing a
sequence of search, insert and delete operations.
following, we evaluate RMART’s performance when executed on a Non-Uniform
Memory Architecture (NUMA). NUMA is a computer memory design used in
multiprocessors, where the memory access time depends on the memory location
relative to a processor. The next section summarizes our benchmark results performed on a Sun Fire X4600.
8.6
Benchmark on Sun Fire X4600
The Sun Fire X4600 M2 server is a NUMA system that supports up to eight internal CPU/memory modules. Each module holds a single dual-core AMD Opteron
processor, where each core has a dedicated 1 MB Level-2 cache. Further, each
module supports up to 8 DDR2 memory DIMM slots (1, 2, or 4 GB DIMMs).
Processors are directly connected to memory, I/O, and each other via Hypertransport links. Hypertransport technology is a high-speed, low-latency, point-to-point
link designed to increase the communication speed between integrated circuits in
computers, servers, embedded systems, and networking and telecommunications
equipment [85]. Figure 8.10 illustrates the Hypertransport topology for an eightprocessor configuration.
With NUMA, maintaining cache coherence across shared memory has a significant overhead. By using inter-processor communication between cache controllers
the memory image is kept consistent when more than one cache stores the same
memory location. For this reason, cache-coherent NUMA performs poorly when
multiple processors attempt to access the same memory area in rapid succession.
In the following we will summarize our benchmark results performed on a Sun
Fire X4600 M2 server with an eight-processor configuration and compare it with
our results from the Sun Fire T2000 Server. When solely lookups (1Mio) are performed, there is a 48% advantage in the case of eight parallel processes compared
to the case of two processes in terms of total time on the T2000. When maintaing
a queue containing one million lookups on the X4600, both the RMART and the
MART do not scale with the number of processes, i.e., there is almost no advantage
8.6. Benchmark on Sun Fire X4600
Figure 8.10: Hypertransport topology in Sun Fire X4600 M2 servers for an eightprocessor configuration. From [85].
when four processes perform the sequence of lookups compared to when only two
processes perform these operations. When the queue size is reduced and only 100k
lookups are performed, the total time of both trees scales well with the number
of processes. The total time of both trees descends until six processes with an
advantage of 39%.
When solely updates are performed on T2000, the total time descends until three
threads in the RMART. There is a 17% advantage in the case of three processes
compared to the case of two executing RMART processes. When performed on
the X4600 server the total time descends until five concurrent processes with an
advantage of 33%.
When updates and lookups are interleaved (all in all 100k operations), the total
time descends barely until four processes with an advantage of 12% when performed on T2000. When performed on the X4600, the total time descends until
five threads with an advantage of 32%. Here, the performance gain when performing a search operation in the RMART compared to the MART constitutes 27% in
the case of four (five) processes when performed on the T2000 (X4600). Hence,
also on this architecture lookups are less delayed in the RMART.
Even though the (R)MART is NUMA unfriendly (the processes work on common
data), the RMART has shown to scale quite well up to a certain number of parallel
processes on the X4600 server (provided that the sequence of operations, on which
all processes work concurrently, was not too large). Yet, not only the underlying
architecture plays a role in the scalability, but also the implementation of the Java
77
78
Chapter 8. Experimental Results
Virtual Machine, particularly in our case the implementation of thread synchronization, targeted to the processor and operating system combination.
Performed on an uniform or non-uniform memory architecture, the performance
of software running on a microprocessor is unfavorably affected by the instruction
cycle. There are basically four stages of an instruction cycle that a microprocessor
carries out:
1. Fetch the next instruction into the Current Instruction Register (CIR)
2. Decode the instruction
3. Execute the instruction
4. Store results back
The term “instruction cycle” refers to both the series of four steps and also the
amount of time that it takes to carry out the four steps. Most microprocessors
are typically divided into two main components: a datapath and a control unit
[86]. The first two steps of the instruction cycle are performed by the control
unit. The datapath mainly consists of an arithmetic logic unit (ALU) which is a
digital circuit that performs arithmetic and logical operations. A digital circuit is
often constructed from small electronic circuits called logic gates. Each logic gate
represents a function of boolean logic, e.g., AND and OR [87]. The inputs to the
ALU are the data to be operated on and a code from the control unit indicating
which operation to perform.
Implementing the software algorithm directly in hardware, i.e., a dedicated datapath and control unit that executes only a particular algorithm, can alleviate the
performance penalty of the fetch-decode steps.
A Field-Programmable Gate Array (FPGA) is an integrated circuit that contains
programmable logic components called “logic blocks”, and programmable interconnects. Logic blocks can be programmed to perform the function of basic logic
gates such as AND and NOT [88] [89]. Complex designs are created by combining
these basic blocks to create the desired circuit. To configure an FPGA the user
specifies the FPGA’s function with a logic circuit diagram or a hardware description language (HDL). This specification is fed to a software suite from the FPGA
vendor that produces a file which is then transferred to the FPGA [88].
Packet forwarding in high-speed IP routers must be done directly in hardware.
In the next section we will describe how the RMART could be described by a
hardware description language.
8.7
Implementing the RMART in hardware
Field-Programmable Gate Arrays offer the flexibility of software executed on a
microprocessor along with increased performance in terms of throughput. How-
8.7. Implementing the RMART in hardware
ever, this flexibility usually requires expertise in hardware design and a hardware
description language such as VHDL or Verilog. The High-Performance FPGA
Laboratory (HPFL) at Oakland University has developed a compiler that is able
to convert programs written in Java to datapaths and corresponding control units,
collectively called flowpaths [90]. Flowpaths can then be synthesized to hardware using a HDL. The compiler takes as input Java byte codes generated from
a Sun Microsystems compliant Java compiler. Java is one of several softwareprogramming languages that compiles to an intermediate representation (IR) that
is stack-based. To execute instructions, variables are loaded onto the stack, the
instruction is executed, and the answer is restored. This load-execute-store feature of Java byte codes maintains the normal microprocessor paradigm. Indeed
microprocessors have been developed that execute Java byte codes directly [91].
However, the performance of these microprocessors still suffers from the excessive
use of local variables in the original software program [92].
Experimental results show that flowpaths can perform within a factor of two
of a minimal hand-crafted direct hardware implementation and orders of magnitude better than compiling the program to a microprocessor [93]. Duchene and
Hanna further describe a technique to extend the flowpath architecture to generate flowpaths directly from java byte codes representing multithreaded java
programs [94]. Java supports thread synchronization through the use of monitors.
When a thread holds the monitor for some object, other threads are locked out
and cannot inspect or modify the object. Java uses the synchronized keyword,
see section 6, to mark sections that operate on common data. In the Java language, a unique monitor is associated with every object that has a synchronized
method. The synchronized keyword is reflected by monitorenter/monitorexit in
the corresponding java byte codes. Given a multithreaded algorithm, each thread
is converted to a flowpath where the logic for requesting and releasing locks to
shared memory is added to each flowpath. Further, each flowpath is connected to
an access controller which controls access to shared memory, cf. Figure 8.11.
Space for the instance variables of an object is allocated when the application
instantiates an object instance from a class (with new) [92]. A traditional producer/consumer example to create flowpaths from a multithreaded Java application can be found in [94].
Performance increases occur, in general, since flowpaths created from multithreaded
Java programs do not suffer from traditional processor bottlenecks such as context
switching, stack manipulation and the traditional instruction cycle [92].
This extended scheme could be employed in order to implement the RMART
directly in hardware. The tasks are the threads that operate on the tree and are
contained in a single task frame. Further research is required to investigate how
the RMART performs when implemented directly in hardware. A current Xilinx
FPGA operates at over 550 MHz, contains a total of over 10 megabit of embedded
79
80
Chapter 8. Experimental Results
Figure 8.11: N tasks with shared data. From [94].
memory, and provides high bandwidth interfaces to several megabytes of off-chip
memory.
Chapter 9
Conclusions and Future
Directions
In order to efficiently support update bursts and to reduce IP lookup latency, we
proposed an elegant way for the representation of dynamic routing tables, namely
relaxed balanced min-augmented range trees. The interactive animation of the
RMART enhanced the understanding of its concepts and furnished an additional
indication that the proposed locking scheme is deadlock-free.
We have benchmarked the relaxed min-augmented range tree versus the strictly
balanced version using real IPv4 routing data. The experimental results confirmed
the hypothesis that the relaxed balanced min-augmented range tree is better suited
than its strictly balanced opponent when confronted with update bursts. Of course,
the simulation environment is not a practical solution for high-speed IP routers.
Rather, packet forwarding must be done in hardware. It would be interesting to
evaluate RMART’s performance when implemented directly in hardware. This
could be achieved by utilizing the extended flowpath architecture as outlined in
section 8.7.
The change of paradigm in IP networks towards QoS, multimedia and real-time
applications calls for fast update rates in higher dimensional classification tables.
Typically, high update rates in packet classification designs have been sacrificed to
achieve better search times or to reduce storage capacity requirements. By virtue
of our experimental results, it would be interesting to study relaxed data structures
which can be used for the representation of higher dimensional packet classifiers.
It is generally accepted that routers will take longer to forward IPv6 packets,
and that the routing tables under IPv6 will get bigger [95] [96]. Further, it is
reasonable to expect that, as the number of hosts connected to the Internet grows,
worst case burst update rates will increase. Further research is needed to scrutinize
the adequacy of the RMART under IPv6.
Part II
Packet Classification
Chapter 10
Introduction
In order to enforce network security policies, guarantee service agreements, perform monitoring etc., network routers are required to examine multiple fields of
the packet header. Geometrically speaking, classifying an arriving packet is equivalent to finding the highest priority rectangle among all rectangles that contain
the point representing the packet. The R-tree, one of the most popular access
methods for multidimensional data, was introduced by Guttman in 1984 [97]. The
R-tree was proposed as an index mechanism that stores d-dimensional geometric
objects and supports spatial retrievals efficiently. The challenge for R-trees is the
following: dynamically maintain the structure in a way that retrieval operations
are supported efficiently. Common retrieval operations are range queries, i.e., find
all objects that a query region intersects, or point queries, i.e., find all objects that
contain a query point.
The R-tree can be easily implemented which considerably contributes to its popularity. Several modifications on the original R-tree have been proposed to either
improve its performance or adapt the structure to a different application domain.
10.1
Goal of this part
The R-tree and its variants, being amongst the most popular access methods for
points and rectangles, have not been experimentally evaluated and benchmarked
for their eligibility for the packet classification problem. In this chapter we investigate how the popular R*-tree is suited for five-dimensional packet classification
in a static environment. To this end we will benchmark the R*-tree with two representative classification algorithms using the ClassBench tools suite [98]. If the
R*-tree shows to be suitable in a static classification scenario, then it can further
be investigated in a dynamic scenario, i.e., where classification is intermixed with
filter updates.
Since spatial databases often involve massive datasets, R-trees and their variants
are often disk-based implemented. Packet classification has to be done as fast as
86
Chapter 10. Introduction
possible, hence in our benchmark, we will use a main memory-based implementation.
10.2
Organization of part II
The remainder of this part is organized as follows. Section 10.3 surveys packet
classification techniques. Chapter 11 describes R-trees and several of its variants.
After presenting the classification algorithm based on R-trees in chapter 12, we
discuss our benchmark results of the R*-tree and two representative classification
algorithms.
10.3
Related work
The RFC (Recursive Flow Classification) scheme is a decomposition-based algorithm, which provides very high classification throughput at the cost of low memory efficiency [99]. RFC performs independent, parallel searches on chunks of the
packet header. Thus, the parallelism offered by hardware can be leveraged. The
result of each chunk lookup is an equivalence class identifier eqID, that represents
the set of potentially matching filters for the packet. An example of assigning
eqIDs is shown in Figure 10.1. In this example, the rectangles are defined by the
filters in our running example filter set in Figure 1.3. The end points of each
rectangle are projected to the axis. Any two adjacent projection points on an
axis define an elementary interval which is fully covered by a set of filters. Two
neighboring elementary intervals cannot represent the same filters; whereas two
nonadjacent elementary intervals are possible to represent the same filters. Each
elementary interval is assigned an eqID. The elementary intervals representing the
same set of filters are labeled with the same eqID. Note that in our example the
fields create six equivalence classes in the source address field and five equivalence
classes in the destination address field.
The results of the chunk searches are combined in multiple phases. RFC lookups
in chunk and aggregation tables utilize indexing. The index tables used for aggregation require significant precomputation in order to assign the proper eqIDs for
the combination of the eqIDs of the previous phases. Such extensive precomputation precludes dynamic updates at high rates.
Several proposed techniques employ a trie-based approach. A hierarchical trie (Htrie) [100] is a multidimensional prefix-based matching scheme, i.e., range-based
fields must be transformed into prefixes first. A H-trie is recursively constructed as
follows: First, a one-dimensional trie is constructed for the first dimension. Then,
for each prefix pr, a (d − 1)-dimensional trie is constructed on those filters that
specify pr in the first dimension. Each node in the first trie that is associated with
10.3. Related work
Figure 10.1: Example of Recursive Flow Classification using the filter set in Figure
1.3.
a prefix is connected to the second trie and so forth. Classification of an incoming
packet starts in the top trie. At each trie node encountered, the algorithm follows the “next-trie” pointer (if present) and traverses the (d − 1)-dimensional trie.
Assuming that the maximum prefix length is w and the number of dimensions
is d, the H-trie requires O(wd ) search time and O(ndw) memory, where n is the
number of filters. Incremental updates can be carried out in O(d2 w) time since
each component of the updated rule is stored in exactly one location at maximum
depth O(dw).
Set-pruning tries improve the search time to O(dw) by replicating rules to eliminate the need for multiple traversals in each of the tries [101] [100]. The query for
an incoming packet with fields (h1 , h2 , . . . , hd ) locates the node associated with the
longest matching prefix for the first field h1 , then follows the “next-trie” pointer
to locate the longest matching prefix for h2 and so forth for all dimensions. The
rules are replicated to ensure that every matching rule will be encountered in the
path. The query time is reduced to O(dw), yet it requires O(nd dw) memory size
for the price of the improved search time. Update complexity is O(nd ), and hence,
this data structure is only suited for relatively static classifiers.
The Grid-of-Tries data structure [102] was proposed for two-dimensional packet
classification and eliminates filter replication by storing filters at a single node and
using switch pointers to direct searches to potentially matching filters. Grid-ofTries bounds memory usage to O(nw) while achieving a search time of O(w). The
authors propose a technique using multiple instances of the Grid-of-Tries structure
for packet classification on the standard 5-tuple, albeit with some loss of efficiency.
87
88
Chapter 10. Introduction
Baboescu, Singh, and Varghese proposed Extended Grid-of-Tries (EGT) that supports multiple field searches without the need for many instances of the Grid-ofTries structure [103]. In the worst case, EGT requires O(w2 ) memory accesses per
classification.
In the following, we will survey several approaches that utilize the geometric view
of the filter set.
Gupta and McKeown introduced a seminal technique called Hierarchical Intelligent
Cuttings (HiCuts) [104]. The concept of cutting comes from viewing the packet
classification problem geometrically. Selecting a decision criteria is analogous to
choosing a partitioning, or cutting, of the space. The decision-tree construction
algorithm recursively cuts the space into smaller sub-regions, one dimension per
step. The cuttings are made by axis-parallel hyperplanes. In order to keep the
decisions at each node simple, each node is cut into equally sized partitions along
a single dimension. The leaves contain a small number of filters bounded by a
threshold. A larger threshold can help reduce the size and depth of a decision tree,
but can yield a longer linear search time. A smaller threshold has the opposite
effects.
Packet header fields are used to traverse the decision tree until a leaf is reached.
The filters stored in that leaf are then linearly searched for a match. If a packet
matches multiple filters, the one with the highest priority is returned.
Figure 10.2 illustrates an example of the decision-tree construction for our example
filter set in Figure 1.3. First, we cut along the x-axis to generate four sub-regions.
If we decide it is affordable to do a linear search on at most three filters, we can
stop cutting sub-regions with three or less filters. Three of the four sub-regions
contain three or less filters, hence we can stop cutting these regions further. At
the following step, we choose the remaining sub-region to cut along the y-axis to
generate two sub-regions. This results in a sub-region containing only two filters
(f2 , f7 ), and another sub-region containing four filters (f2 , f3 , f4 , f7 ). At the last
step, we cut along the y-axis to generate two sub-regions, each containing three
rules respectively. Now, every sub-region contains at most three rules and the
construction terminates.
The resulting data structure is shown in Figure 10.3. Each tree node covers a
portion of the d-dimensional space and the root node covers the entire space. In
this example, we have set the thresholds such that a leaf contains at most three
filters and a node may contain at most four children.
It is very difficult to find the globally optimal decision tree given some constraints.
So in practice the algorithm uses various heuristics to select decision criteria at
each node that minimizes the depth of the tree while controlling the amount of
memory used. Intuitively, the more cuts are made at each step, the fatter and
lower the resulting decision tree will be. However, a large number of cuts may lead
to an excessive duplication of filters. Apart from the number of cuts, the choice of
10.3. Related work
Figure 10.2: A partitioning created by HiCuts for the example filter set in Figure
1.3.
Figure 10.3: HiCuts data structure for the example filter set in Table 1.1. The
maximum size of the set of filters at each leaf is set to three.
89
90
Chapter 10. Introduction
the cutting at each intermediate decision tree node is also critical for the algorithm
performance.
The preprocessing time is high, caused mainly by the complexity of the heuristics.
Incremental update time depends on the filter to be inserted or deleted.
The HyperCuts algorithm introduced in [105] eliminates the limitation in HiCuts
by using the most representative dimensions, as opposed to only a single dimension, to cut the space. This simulates several cuts of HiCuts in one cut. This
approach reduces the height of the decision tree. For each of the chosen dimensions, the number of cuts is computed based on a metric dependent on the amount
of space that is available for the search structure.
Another optimization is the idea of pulling filters up in the decision tree. The
authors observed that a heavily wildcarded filter often ends up in many leaves,
increasing storage consumption. Their approach pulls all common filters in a subtree to a linear list at the root of the subtree.
In order to determine which of the pointers in each node to follow during a search,
array indexing, which costs one memory access regardless of the number of children at a node, is used. In this indexing scheme, cut widths have to be fixed in
each dimension.
Qi and Li [106] propose ExCuts, an extension of HyperCuts, which mainly improves the memory consumption.
Another geometry-based solution for multidimensional classification referred to as
G-filter was proposed by Geraci et al. [107]. The space that represents all possible
values of the packets’ attributes is called the universe. The input for the algorithm
that constructs the search data structure is a region r of the search space, and a list
F (r) of filters potentially intersecting the region r. Initially, the algorithm starts
with the entire filterset F and r equals the universe. The algorithm partitions the
filters in the following sets, with each filter f belonging to exactly one set:
1. if f does not intersect r, it is discarded (a query point in region r will never
match the rule);
2. otherwise, if f covers the entire region r, it becomes part of the set cover (r)
of cover rules;
3. otherwise, if the projection Pj (f ) of f on axis j entirely covers the projection
Pj (r) of the region r on the same axis, f becomes part of the set F Bj (r) of
fallback rules on axis j (if f satisfies this property for more than one axis,
we arbitrarily pick one);
4. otherwise, filter f becomes part of the set cross(r) of cross rules, which
intersect r, but does not fall in any of the other categories.
10.3. Related work
Figure 10.4: For the universe u, f7 ∈ cover(u), f1 , f2 ∈ F B2 (u), f3 , f4 , f5 , f6 ∈
cross(u). For the subregion y2 , f5 , f6 ∈ F B1 (y2 ). For the subregion y4 , f3 , f4 ∈
cross(y4 ).
Figure 10.4 shows a two-dimensional example of the relation between rules and
regions. Any packet p contained in a region r matches all rules in cover (r). The
only information we need to remember from this set is the filter fh (r) with the
highest priority in cover (r), as this will be a potential result for the classification.
For fallback rules, we know that if p ∈ r, then the j-th coordinate of p is within the
range Pj (f ) of all the rules in the set F Bj (r). So p will match a rule f ∈ F Bj (r)
if and only if its remaining (d − 1) coordinates are contained in the remaining
(d − 1) ranges of the rule. So the problem reduces to a classification problem in a
(d − 1)-dimensional region.
Cross rules have to be partitioned further. This is done by recursively partitioning region r into m regions y1 . . . ym of uniform size and shape and assigning the
remaining cross rules to these subregions. Figure 10.5 shows the G-filter for the
example of Figure 10.4 for the first two levels and m = 4. The only cover rule at
root level is filter f7 . Hence it is remembered as fh (r). The cross rules f3 , f4 , f5
and f6 are located in the subregions y2 and y4 . Hence, subregions y1 and y3 point
to null. Rules f5 and f6 are assigned to F B1 (y2 ). Filters f3 and f4 are assigned to
cross(y4 ) and hence subregion y4 is partitioned further.
The classification can be performed as a recursive process on the data structure. At each node (initially the root), we perform d recursive queries on the
(d − 1)-dimensional fallback structures, one recursive query on the region yi |p ∈ yi ,
and return the highest priority rule among fh (r) and the rules returned by the
(d + 1) recursive queries. As an example, consider the packet p with coordinates
(900, 900). The highest priority filter fh (r) is f7 . The highest priority rule returned
by the fallback structures is f1 . The subregion y1 is not further partitioned. Hence
the query algorithm yields f1 as the highest priority rule.
Let F be a set of n hyperrectangles in a d-dimensional universe u and k a param-
91
92
Chapter 10. Introduction
Figure 10.5: The G-filter for the example of Figure 10.4 for the first two levels and
m = 4.
eter, 1 ≤ k ≤ n. The data structure uses O(nk f (d) logdk |u|) space and performs
packet classification in time O(logdk |u|). The function f (d) grows roughly as d2 /2.
The structure does not support incremental updates, i.e., the structure has to be
reconstructed from scratch each time the classifier changes.
The Area-based Quadtree (AQT) was proposed by Buddhikot et al. [108] for twodimensional classification on the source and the destination prefix fields. The
search space is recursively partitioned into four equally sized spaces. Each rectangular search space is mapped into a node in a quadtree. In other words, the
entire space is mapped into the root node of the quadtree, and four equally sized
quadrants are mapped into four children of the root node, and so forth.
Rules are allocated to each node as follows. A filter is said to cross a quadrant if
it completely spans at least one dimension of the quadrant. The authors call the
set of all filters that cross a given region r as its Crossing Filter Set (CFS). The
CFS of a region r can be split into two sets CX(r) and CY (r). The former is the
set of filters that cross r perpendicular to the x-axis. The set CY (r) is the set of
filters that cross r perpendicular to the y-axis. In Figure 1.3, f1 and f2 belong to
CX(210 × 210 ). For both sets, only the range specified in the other dimension has
to be stored.
Each filter f is stored exactly once, at the highest node for which f is a crossing
filter.
At each node, we search the CFS structure for the highest priority filter at that
node (two one-dimensional lookups). If the filter has higher priority than the filter
recorded so far, we replace that filter and continue.
The AQT can take advantage of a well known technique called fractional cascading
to reduce the O(h log w) worst case search complexity to O(h + log w), where the
worst case height h is w (w is the maximum prefix length). The memory require-
10.3. Related work
ment is O(n), because each filter is stored exactly once.
AQT supports incremental updates allowing the complexity to be traded off with
the query time by a tunable parameter.
Lim, Kang and Yim propose a priority-based quadtree (PQT) for two-dimensional
packet classification [109]. By additionally utilizing the priority of rules, the number of tree levels can be reduced.
A survey of packet classification techniques can be found in [38] and [100].
We have seen that a number of solutions perform a linear search over a bounded
subset of filters as the final step, e.g., [104], [110], [105]. If the lists are kept short,
this typically results in a large savings in terms of data structure space, at only a
small cost in terms of classification performance. We will propose a classification
scheme based on R-trees that also follows this approach.
Currently, most vendors maintain their packet filters in Ternary Content Addressable Memories (TCAMs) [39]. For example, the Cisco Catalyst 6500 Series Switch
and Cisco 7600 Router maintain QoS and security policies in TCAMs which are
accessed by application-specific integrated circuits (ASICs) [111] [112] [113]. However, the usefulness of TCAMs is limited by their high power consumption, inefficient representation of range match fields and lack of flexibility and programmability. Juniper Networks uses an ASIC-driven memory approach with more conventional memory architectures like Static RAM (SRAM) and Reduced Latency
Dynamic RAM (RLDRAM) and sophisticated data structures to packet classification [39].1
Although many algorithms and architectures have been proposed, the design of
efficient packet classification systems remains a challenging problem. In the following we will describe the R-tree and how it supports packet classification.
1
According to Juniper Networks, the 50 Gbps T-series Packet Forwarding Engine (PFE) is
currently the newest, most flexible and highest performing PFE on the market [39].
93
Chapter 11
R-trees
11.1
The original R-tree
R-trees are hierarchical data structures based on B+-trees [114] and were introduced as an index for multidimensional information. The R-tree abstracts an
object o by using its minimum bounding d-dimensional rectangle (MBR). The
leaves of the tree contain the MBRs of the objects as well as pointers to these
objects. Each non-leaf node of the R-tree contains entries which store a pointer
to a child node and the MBR that bounds all rectangles in that child node. An
example of eight objects (o7 − o14) and a possible organization of these objects
using six MBRs (R1 − R6) is shown in Figure 11.1. A corresponding R-tree is
visualized in Figure 11.2. The space is split by hierarchically nested, and possibly
overlapping minimum bounding rectangles. Hence, an object may be contained in
several MBRs, but it is associated to only one R-tree node. For example, object
o8 is contained in R3 and R4, but is only stored in the leaf pointed by R3.
Let m be the minimum and M the maximum allowed number of entries that each
node can store and 2 ≤ m ≤ M/2. The R-tree of order (m, M ) has the following
characteristics:
• Each leaf node entry is of the form (mbr; oid), such that mbr is the MBR
that spatially contains the object and oid is the object’s identifier.
• Each entry in an internal node is of the form (mbr; p), where p is a pointer to
a child of the node and mbr is the MBR that spatially contains the rectangles
in this child.
• The minimum allowed number of entries in the root node is two, unless it is
a leaf. In this case, it may contain zero or a single entry.
• All leaves of the R-tree are at the same level.
96
Chapter 11. R-trees
Figure 11.1: An example of eight objects (o7 − o14) and a possible organization
using six MBRs (R1 − R6).
Figure 11.2: A corresponding R-tree.
Let n be the number of objects. The maximum value for the height h is [115]:
hmax = dlogm ne − 1.
The maximum number of nodes can be derived by summing the maximum possible number of nodes per level. This number comes up when all nodes contain
the minimum allowed number of entries, i.e., m. Therefore, it results that the
maximum number of nodes in an R-tree is equal to:
h
X
mi
i=0
11.1.1
Query processing
The processing of a range (point) query commences from the root node of the
tree. For each entry whose MBR intersects (contains) the query region (point),
the process descends to the corresponding subtree. At the leaf level, for each
bounding rectangle that intersects (contains) the query region, the corresponding
object is examined. The algorithm that processes point queries in an R-tree is
given in Algorithm 6. For a node entry e, e.mbr denotes the corresponding MBR
11.1. The original R-tree
and e.p the corresponding pointer to the next level. If the node is a leaf, then e.p
denotes the corresponding object identifier (oid).
Algorithm 6 Point Query
1: procedure PointQuery(TypeNode N , TypePoint Q)
2:
if N is not a leaf node then
3:
examine each entry e of N to find those e.mbr that contain Q
4:
for each such entry e call PointQuery(e.p, Q)
5:
else
6:
examine all entries e and find those for which e.mbr contains Q
7:
add these entries to the answer set
8:
end if
9: end procedure
The MBRs may overlap each other. Thus it cannot be guaranteed that only one
search path is traversed during a point query. As we will see in subsection 12.1.2,
the inquiry if a MBR contains a query point can be implemented efficiently. This
is of practical interest to our benchmark, since this determines the complexity of
a point query and hence of packet classification.
11.1.2
Query optimization criteria
In the following some of the parameters which are essential for the retrieval performance are considered [116].
• The area covered by a MBR should be minimized, i.e., the area covered by
the bounding rectangle but not covered by the enclosed rectangles, the dead
space, should be minimized. This will improve performance since decisions
which paths have to be traversed, can be taken on higher levels.
• The overlap between MBRs should be minimized. This also decreases the
number of paths to be traversed.
• The perimeter of a MBR should be minimized. Assuming a fixed area, the
object with the smallest margin is the square. Thus minimizing the perimeter instead of the area, the MBRs will be shaped more quadratic. Essentially
queries with large quadratic query rectangles will profit from this optimization.
• Storage utilization should be optimized. Higher storage utilization will generally reduce the query cost as the height of the tree will be kept low.
97
98
Chapter 11. R-trees
11.1.3
Updates
The R-tree is a dynamic structure. Thus all approaches of optimizing the retrieval
performance have to be applied during the insertion or deletion of a new object.
The insertion algorithm calls two more algorithms in which the crucial decisions
for good retrieval performance are made. The first is the algorithm ChooseSubtree.
Beginning at the root, descending to a leaf, it finds on every level the most suitable
subtree to accommodate the new entry. The second is the algorithm Split. It is
called, if ChooseSubtree ends in a node filled with the maximum number of entries
M . Split should distribute (M +1) rectangles into two nodes in a way that makes it
as unlikely as possible that both new nodes will need to be examined on subsequent
searches.
11.2
R-tree variants
In all R-tree variants that have appeared in the literature, tree traversals for any
kind of operations are executed in exactly the same way as in the original R-tree.
Basically, the variations of R-trees differ in how they choose the appropriate subtree
and how they perform splits during insertion by considering different minimization
criteria [115].
Insertions of new objects are directed to leaf nodes. At each level, the most suitable
subtree to accommodate the new entry has to be chosen. In the original R-tree
as proposed by Guttman, this is the node that needs least area enlargement to
include the new object. Ties are resolved by choosing the entry with the rectangle
of the smallest area. Finally the object is inserted in an existing leaf if there is
adequate space, otherwise a split takes place. Since the decision whether to visit a
node depends on whether its covering rectangle overlaps the search area, the total
area of the two covering rectangles after a split should be minimized. Guttman
discusses split-algorithms with exponential, quadratic and linear cost with respect
to the number of entries in a node. All of them are designed to minimize the area
covered by the two rectangles resulting from the split.
Then, covering rectangles on the path from the leaf to the root need to be adjusted,
and node splits propagated as necessary. If node split propagation causes the root
to split, a new root is created whose children are the two resulting nodes.
If a deletion of an entry causes a leaf to underflow, the leaf is eliminated and the
remaining entries are reinserted. Node elimination must be propagated upwards.
All covering rectangles on the path to the root need to be adjusted, making them
smaller if possible. In the following we will have a look at four R-tree variants, for
a comprehensive survey refer to [115] [117].
11.2. R-tree variants
11.2.1
The R+-tree
The R+-tree was proposed as a structure that avoids visiting multiple paths during
point queries [118]. To achieve this, R+-trees do not allow overlapping of MBRs
at the same tree level. Therefore, inserted objects may have to be divided in two
or more MBRs, and stored in various nodes which results in an increase in space
consumption. Further, when a node with (M + 1) rectangles, where each rectangle
encloses a smaller one, has to be split, the split procedure will fail.
11.2.2
The R*-tree
The new concepts incorporated in the R*-tree [116] are based on the minimization of the overlapping between MBRs at the same level, the minimization of the
perimeter of the produced MBRs, as well as the maximization of storage utilization. The R*-tree follows a sophisticated node split technique and uses the concept
of forced reinsertion.
Dynamic updates of the structure may have introduced MBRs which are not suitable to guarantee a good retrieval performance in the current situation. Therefore,
the R*-tree forces entries to be reinserted during the insertion routine. If a node
overflows, it is not split right away. Rather, p entries are removed from the node
and reinserted into the tree. Hence, each first overflow treatment on each level will
be a reinsertion of p entries. If it is not the first overflow on that level, the split
procedure is invoked.
Experiments have shown that p = 30% yields the best performance [116].
In summary, the R*-tree differs from the R-tree mainly in the insertion algorithm;
deletion and searching remain essentially unchanged.
11.2.3
Compact R-trees
Huang, Lin and Lin proposed compact R-trees, a dynamic R-tree version which
can achieve almost 100% storage utilization [119]. Among the (M +1) entries of an
overflowing node during insertions, a set of M entries is selected to remain in this
node, such that the resulting MBR is the minimum possible. Then, the remaining
entry is inserted to a sibling that (i) has available space, and (ii) whose MBR is
enlarged least. Thus the frequency of node splitting is reduced significantly. The
range query performance is similar to that of the original R-tree.
11.2.4
cR-trees
Brakatsoulas, Pfoser and Theodoridis have altered the assumption that an overflowing node has to be split in exactly two nodes [120]. In particular, they rely on
the k-means clustering algorithm and allow an overflowing node to be split in up
to k nodes (k ≥ 2). Their benchmarks showed that the resulting index quality, the
99
100
Chapter 11. R-trees
retrieval performance and the insertion time are significantly better than those of
R-trees (assuming quadratic split) and similar to those of R*-trees.
11.2.5
Static versions of R-trees
There are common applications that use static data. For instance, insertions and
deletions in census, cartographic and environmental databases are rare. Here, the
data is known in advance, and this fact is utilized in order to build a structure
that supports queries as efficient as possible. This method is well known in the
literature as “packing” or “bulk loading”.
The Packed R-tree [121] proposed by Roussopoulos and Leifker in 1985 was the
first packing algorithm, soon after the proposal of the original R-tree. This first
effort basically suggests ordering the objects according to some spatial criterion,
e.g., according to ascending x-coordinates.
Arge et al. [122] propose the Priority R-tree, or PR-tree, which is the first R-tree
variant that always answers a window query using O((n/B)1−1/d + T /B) I/Os,
where n is the number of d-dimensional (hyper-) rectangles stored in the R-tree,
B is the disk block size, and T is the output size. This is provably asymptotically
optimal and significantly better than other R-tree variants, where a query may
visit all n/B leaves in the tree even when T = 0.
Chapter 12
Packet Classification using
R-trees
Classifying an arriving packet is equivalent to finding the highest priority rectangle among all rectangles that contain the point representing the packet. All
filters have a priority attached and are located in the R-tree leaf nodes. Based
on the value of the packet header, the algorithm follows the appropriate pointers
to locate the target MBR(s), i.e., leaf node(s) in the decision tree, as described
in Algorithm 6. Note that we do not need the refinement step, where the object
itself (not its MBR) is examined for containment, since in our case, the objects are
rectangles and hence identical to their MBRs. During the traversal we keep track
of the highest priority filter so far. After all potentially matching filters have been
visited, the highest priority filter is reported.
This search scheme has some parallels with the classification scheme of [104] [105]
[41], cf. section 10.3. In these methods, a classification is performed by traversing
a sophisticated data structure which yields not just one matching filter, but a short
list over which a linear search is performed.
A requirement of recent network security applications, e.g., network intrusion detection systems, transparent monitoring and usage-based accounting, is that all
matching filters are reported, not just the highest priority filter. R-trees support
this at no additional cost.
12.1
Performance evaluation
We measure the performance of a classification operation by the number of bytes
inspected, see below for more details. Even though only a single path from the
root to the leaf is traversed during a search, the problems inherent in R+-trees,
cf. subsection 11.2.1, disqualify this structure for further investigation. The R*tree is widely accepted in the literature as a high-performance structure among
R-trees and its variants and is often used for performance comparisons [115]. In
102
Chapter 12. Packet Classification using R-trees
our classification performance evaluation we will use the R*-tree and benchmark it
with representative packet classification algorithms. The benchmark is performed
on a Pentium 4 dual core 2.8 GHz machine.
In our simulations we will use a main memory-based implementation of R*-trees.
12.1.1
Filter sets
Different filter sets with different structures and sizes tend to give very different
results. The performance of the algorithm on “real” filter sets is the decisive factor in any realistic evaluation. Due to security and confidentiality reasons, real
filter sets are hardly available. In response to this problem, Taylor and Turner
developed ClassBench, a suite of tools for benchmarking packet classification algorithms [123] [98]. The ClassBench Tool includes a Filter Set Generator which
produces synthetic filter sets that accurately model the characteristics of real filter
sets. The tools suite also includes a Trace Generator that produces a sequence of
packet headers to exercise the synthetic filter set. To implement these tools, Taylor and Turner analysed 12 real filter sets provided by Internet Service Providers
(ISPs), a network equipment vendor, and other researchers working in the field.
The filter sets range in size from 68 to 4557 entries and utilize one of the following
formats [124]:
• Access Control List (ACL) - standard format for security, VPN, and NAT
filters for firewalls and routers (enterprise, edge, and backbone)
• IP Chain (IPC) - decision tree format for security, VPN, and NAT filters for
software-based systems
• Firewall (FW) - proprietary format for specifying security filters for firewalls
Their analysis provides invaluable insight into the structure of real filter sets.
A repository [125] has been established to provide synthetic filter sets and trace files
generated with ClassBench as well as source codes of representative classification
algorithms. These synthetic sets are generated with the ClassBench tools suite
using seed filter sets that are extracted from three real filter sets, utilizing the
above mentioned three different formats. These real filter sets have the following
characteristics [124]:
• acl1 : In this filter set, fully specified source and destination addresses dominate the distribution. The destination port specification can be either an
exact value (in most cases), a wildcard or an arbitrary range. All source
ports are specified by a wildcard.
• fw1 : The most common prefix pair is a fully specified destination address
and a wildcard for the source address. The ports are specified either by a
wildcard, an exact value, an arbitrary range or a HI range ([1023 : 65535]).
12.1. Performance evaluation
• ipc1 : Fully specified source and destination addresses dominate the distribution, yet not as much as in acl1. Port specifications are as in the fw1 set,
yet with different distributions.
The protocol is specified by a unique value or the wildcard. See [124] for detailed
characteristics.
The repository provides each type of synthetic filter set in the size of 100, 1K, 5K
and 10K filters and a corresponding trace file for each of the filter sets. The size
of a trace is about ten times that of the corresponding filter set.
12.1.2
Simulation results of R*-tree
In this simulation we use the filter and trace files provided by [125]. All filters are
five-dimensional with 32-bit source and destination IP addresses, 16-bit source and
destination port numbers and an eight-bit protocol. For each filter set and corresponding trace file, we evaluate the R*-tree’s performance. In each simulation,
we iteratively insert the respective filters into an initially empty R*-tree. After
all filters have been inserted, we use this tree for packet classification. Therefore,
once the R*-tree has been built, we use it in a static fashion. If the R*-tree shows
to be suitable in a static classification scenario, then it can further be investigated
in a dynamic scenario, i.e., where classification is intermixed with filter updates.
Hence, the following benchmark can be considered as a stepping stone for benchmarking R*-trees in a dynamic classification environment. The algorithms that it
will be benchmarked with are both optimized for static scenarios.
In our simulations we measure the total memory requirement, the worst case as
well as the average number of bytes inspected per classification. To measure the
total memory requirement, we summate the memory consumption of the contents
of all nodes. A node mainly maintains its level information, its capacity M , the
number of its children, its MBR and identifier, as well as its children MBRs and
IDs.
To measure the number of bytes per classification, we summate the number of
bytes that are read at each node visit. For efficient packet classification, that
number must be kept at a minimum. At each node we have to examine each entry’s MBR to determine which path(s) to descend. When examining an entry’s
MBR, we check each of the MBR’s dimensions one after the other if it contains the
packet to be classified, using the packet’s coordinate in the respective dimension.
This process is aborted as soon as the packet falls out of range in one of the five
dimensions.
Our implementations are based on the R/R*-tree implementations by Hadjieleftheriou [126]. Figure 12.1 shows the evaluation of the R*-tree’s simulation results
for the ACL, FW and IPC classification types. The results show that the R*-tree
is very space efficient. Even in the case of 10k filters, total memory consumption
remains below 600 KB. In terms of classification performance, the results show
103
104
800
Bytes / filter
80.0
600
400
200
0
100
5.551
7.237
5.411
ACL1
IPC1
FW1
1000
IPC1
592.013
509.915
541.003
100
1000
5000
10000
56.6
73.1
58.8
62.2
64.7
58.5
58.4
65.9
55.5
61.6
56.4
58.1
# filters
20000
10000
100
1000
5000
10000
1348
905
1418
7254
18717
4453
14343
4260
11980
# filters
31207
24478
16786
ACL1
ACL1
IPC1
FW1
ACL1
30000
ACL1
IPC1
FW1
60.0
FW1
40000
0
70.0
50.0
10000
56.987
257.937
60.701
294.117
46.235
258.391
# filters
ACL1
Bytes / Classification
(Worst Case)
5000
IPC1
FW1
Bytes / Classification
(Avg. Case)
Total Memory [KB]
Chapter 12. Packet Classification using R-trees
IPC1
FW1
20000
15000
10000
5000
0
ACL1
IPC1
FW1
100
773.1
493.9
888.9
1000
5000
3802.2
9430
2233
6297.2
2376.4
5080.4
# filters
ACL1
IPC1
10000
14886.1
10752.9
6397.7
FW1
Figure 12.1: Performance Evaluation R*.
that the R*-tree scales best (in the number of filters) for the FW type.
We further tested the R-tree, utilizing quadratic split, and benchmarked it with the
R*-tree. The R*-tree consistently demonstrated better performance for all three
classification types. As already discussed, the R-tree is based solely on the area
minimization of each MBR. On the other hand, the R*-tree goes beyond this criterion and incorporates the minimization of the overlapping between MBRs at the
same level, as well as the minimization of the perimeter of the produced MBRs,
which improve query processing performance. Hence, in the following, we only
consider the R*-tree.
In order to investigate how the R*-tree is suited for packet classification, we benchmark it with Hypercuts [105] and RFC [99]. RFC appears to be the fastest classification algorithm for static filter sets in the current literature. HiCuts [104]
and its improved version HyperCuts are seminal techniques providing excellent
tradeoffs. According to [127], HyperCuts is one of the most promising algorithmic
solutions. For this benchmark we used the source codes of HyperCuts and RFC
provided by [125]. Along with the codes, [125] further provides an evaluation of
these packet classification algorithms measuring the amount of bytes consumed per
filter as well as the worst case and average number of bytes read per classification.
HyperCuts involves some tradeoffs, heuristics, and optimizations. These tunable
12.1. Performance evaluation
parameters have tremendous effects on HyperCuts’ performance. It is important
to isolate them and evaluate their behavior carefully in order to clarify their impact on the algorithm. [125] provides an evalutaion only for a selected set of filter
sets, namely the FW filter set size of 100, IPC filter set of size 1K and ACL set
of size 10K. Yet, to get fair benchmark results, it is necessary to determine HyperCuts optimal parameters for all of the given filter sets. Therefore, we conduct
a thorough performance evalutaion of HyperCuts, which will be presented in the
following subsection.
12.1.3
Benchmark of R*-tree and HyperCuts
As we will see, the performance of HyperCuts is highly sensitive to the configurable
parameters, which are the space factor, the bucket size and the filter push level.
The space factor is used to bound the number of cuts on each chosen dimension.
The bucket size determines the maximum number of filters allowed in a leaf node.
It is used to determine when to terminate the decision tree construction. A larger
bucket size can help to reduce the size and the depth of the decision tree, but
induces a longer linear search time. A smaller bucket size has the counter effects.
HyperCuts pulls all common filters in a subtree to a linear list at the root of the
subtree. The filter push level restricts the number of tree levels that common filters are pulled upwards.
After a sequence of cuttings are performed, the portion of a hypercube in a subregion might be fully covered by a hypercube with a higher priority. The corresponding filter at this decision tree node is redundant thus it can be removed to
save the storage. This is referred to as the filter overlap optimization. Our simulations have shown that the filter overlap optimization can reduce the storage to
some extent but has no significant effect on the classification throughput. Hence
in our simulations, we enable the filter overlap optimization.
In the following simulations, we measure the amount of bytes consumed per filter
as well as the number of bytes per classification in the worst case for the various
filter sets.
Filter set size 100
Figure 12.2 shows the HyperCuts performance evaluation results in terms of its
sensitivity to (i) the space factor, (ii) the bucket size as well as (iii) the filter push
level. The storage decreases when the bucket size increases. Generally a larger
bucket size means a worse lookup throughput but this is not always the case. When
the filter pushing optimization is disabled, i.e., the filter push level is set to zero,
the storage use is very inefficient for the FW filter set, yet the throughput is the
best. When the filter push level is increased, the storage efficiency is significantly
improved. However, the throughput becomes much worse.
For the IPC set, increasing the filter push level worsens classification performance,
105
106
200
180
160
140
120
100
80
60
40
20
Bytes / Classification
(Worst Case)
Bytes / Filter
Chapter 12. Packet Classification using R-trees
1
2
4
8
1200
1000
800
600
400
200
ACL_100
IPC_100
FW_100
16
1
2
624
372
1118
516
236
944
Space Factor
IPC_100
FW_100
ACL_100
Bytes / Classification
(Worst Case)
Bytes / Filter
ACL_100
60
40
20
ACL_100
IPC_100
FW_100
10
16
24
32
40
48.9
37.7
67
31.5
34.6
66.7
28.1
29.9
54.5
Bucket Size
28.1
29.1
50.1
24.9
29.1
47
IPC_100
Bytes / Filter
1000
0
ACL_100
IPC_100
FW_100
0
1
31.5
37.7
3384.6
31.5
25.7
46
2
3
31.5
31.5
25.7
25.7
35.8
35.8
Filter Push Level
ACL_100
IPC_100
FW_100
4
31.5
25.7
35.8
IPC_100
FW_100
700
200
10
16
360
220
948
316
252
992
ACL_100
2000
16
356
212
956
1200
FW_100
3000
8
1700
ACL_100
IPC_100
FW_100
Bytes / Classification
(Worst Case)
ACL_100
4
360
372
220
212
948
880
Space Factor
24
32
40
452
452
436
476
1192
1256
Bucket Size
IPC_100
776
476
1460
FW_100
1000
800
600
400
200
ACL_100
IPC_100
FW_100
0
1
316
220
468
316
422
944
2
3
316
316
422
422
1034
1034
Filter Push Level
ACL_100
IPC_100
4
316
422
1034
FW_100
Figure 12.2: Hypercuts Performance Evalutaion for filter set size 100.
yet without affecting storage. In case of ACL, the algorithm is insensitive to the
number of filter push levels.
For the FW filter set, 46 bytes per filter and 944 bytes per classification are a good
tradeoff. The average number of bytes read per classification in this setting is 605.
In comparison, R* needs 54 bytes per filter, 1400 bytes per classification in the
worst and 889 in the average case.
For the IPC filter set, 37 / 220 (bytes consumed per filter/ bytes inspected in the
worst case) give a good tradeoff. R* consumes 72 bytes per filter and inspects 905
bytes per classification in the worst case.
12.1. Performance evaluation
In simulation (i), the bucket size was set to 10, the push level was set to 1 in case
of FW, and zero for the ACL and IPC filter sets. In simulation (ii), the space
factor was set to four, the push level was set to 1 in case of FW, and zero for the
ACL and IPC filter sets. The bucket size was set to ten in case of FW and IPC,
and 16 in case of ACL in simulation (iii). Further, the space factor was set to two
(FW) and four (IPC, ACL) respectively.
Filter set size 1K
Figure 12.3 shows the HyperCuts performance evaluation results for filter set sizes
of 1K. For the FW filter set, 350 / 1912 give a good tradeoff. For best memory
efficiency, i.e., 52 bytes per filter, classification takes 7796 bytes in the worst case
and 4077 in the average case. In comparison, R* needs 58 bytes per filter and 4260
bytes per classification in the worst and 2376 in the average case.
Remark: for the FW type, the algorithm does not work for all parameter settings,
cf. Figure 12.3, simulation (ii) and (iii).
For the IPC set, an increasing space factor yields better performance and worse
memory consumption. Vice versa is the sensitivity to the bucket size. The storage
decreases monotonously when the bucket size increases.
The algorithm is highly sensitive to the filter push level. When the filter push
level is increased, classification drastically increases for the IPC set. For the IPC
set, 61 / 660 give a good tradeoff. In comparison, R* needs 64 bytes per filter and
4453 bytes per classification in the worst case.
In simulation (i), the bucket size was set to 24 (FW) and 16 (IPC, ACL) respectively. The push level was set to 1 in case of FW, and zero for the ACL and IPC
filter sets. In simulation (ii) and (iii), the space factor was set to one (FW), two
(IPC) and four (ACL). The push level was set to 1 in case of FW, and zero for the
ACL and IPC filter sets. In simulation (iii) the bucket size was set to 24 in case
of FW and IPC, and 16 in case of ACL.
Filter set size 5K
Figure 12.4 shows the HyperCuts performance evaluation results for filter set sizes
of 5K. For the FW filter set, Hypercuts needs 7 MB in total for a worst case
performance of 3038. In case of good memory efficiency (51 bytes per filter), the
algorithm needs twice as much bytes per classification in the worst case and 2563
on average. In comparison, R* needs 55 bytes per filter and inspects approximately
12000 bytes per classification in the worst and 5080 bytes in the average case.
For the IPC filter set, a classification performance of 668 requires above 10 MB
in total. For the IPC set, 187 / 1088 gives a good tradeoff. R* only consumes 65
bytes per filter, but shows much worse classification performance.
For the ACL set, 59 / 772 give a good tradeoff. R* is on par in terms of memory
consumption, but has poor classification performance.
107
108
Chapter 12. Packet Classification using R-trees
4000
3000
2000
1000
0
ACL_1K
IPC_1K
FW_1K
1
2
90.7
186.8
1538
58.1
294.7
794
4
8
83.1
203.9
526.6
920.7
340
4785.9
Space Factor
ACL_1K
IPC_1K
Bytes / Classification
(Worst Case)
Bytes / Filter
ACL_1K
IPC_1K
FW_1K
10
16
24
32
40
83.1
294.7
56
92.6
1538.3
Bucket Size
45
61.3
830
42.3
46.2
350.8
IPC_1K
1
2
640
732
1670
516
524
2248
4
8
468
436
424
388
2892
2980
Space Factor
IPC_1K
16
404
392
7796
FW_1K
1800
1300
800
300
ACL_1K
IPC_1K
FW_1K
10
16
408
468
468
524
FW_1K
24
32
540
728
564
660
1670
1816
Bucket Size
ACL_1K
IPC_1K
40
756
792
1912
FW_1K
3000
1000
500
0
83.1
92.6
1
2
3
61.1
52.5
51.9
23.7
23.7
23.7
1538.3
1131
1105
Filter Push Level
ACL_1K
IPC_1K
FW_1K
4
46.6
23.7
777.9
Bytes / Classification
(Worst Case)
1500
Bytes / Filter
0
ACL_1K
158.5
903.1
ACL_1K
ACL_1K
IPC_1K
FW_1K
2000
FW_1K
500
0
4000
302.8
3475.1
52.3
1000
0
6000
16
1500
ACL_1K
IPC_1K
FW_1K
8000
Bytes / Classification
(Worst Case)
Bytes / Filter
5000
2500
2000
1500
1000
500
0
0
1
2
3
4
Filter Push Level
ACL_1K
IPC_1K
FW_1K
Figure 12.3: Hypercuts Performance Evalutaion for filter set size 1K.
If in simulation (iii), the space factor is set to four, the algorithm needs 65.3
bytes per filter and 14876 bytes per classification in the worst case for the IPC set.
As can be seen, the parameters have to be chosen very carefully to achieve good
performance results.
In simulation (i), the bucket size was set to 16 (FW) and 24 (IPC, ACL) respectively. The push level was set to 1 in case of FW, and zero for the ACL and IPC
filter sets. In simulation (ii) and (iii), the space factor was set to two (FW, ACL)
and one (IPC). The push level was set to 1 in case of FW, and zero for the ACL
and IPC filter sets. In simulation (iii) the bucket size was set to 24 in case of FW
109
7000
Bytes / Filter
6000
5000
4000
3000
2000
1000
0
ACL_5K
IPC_5K
FW_5K
1
2
4
8
16
99.7
664.8
1400
79.1
881.9
2168.8
123.5
2181.4
6279.4
273.2
5026.6
51.3
436.4
Bytes / Classification (Worst Case)
12.1. Performance evaluation
23.8
40500
30500
20500
10500
500
ACL_5K
IPC_5K
FW_5K
1
2
4
8
16
884
876
3038
676
676
3006
576
668
6312
612
576
6274
528
Space Factor
IPC_5K
FW_5K
ACL_5K
4000
Bytes / Filter
3500
3000
2500
2000
1500
1000
500
0
ACL_5K
IPC_5K
FW_5K
10
16
24
32
40
226.3
131.2
3988.8
2168.8
79.1
664.8
694.3
59.6
326.6
683
50.3
187.9
525.2
Bytes / Classification (Worst Case)
ACL_5K
2500
2000
1500
1000
500
ACL_5K
IPC_5K
FW_5K
10
16
24
32
40
504
504
2886
3006
676
876
3106
772
1000
3106
896
1088
3450
Bucket Size
ACL_5K
700
Bytes / Filter
600
500
400
300
200
100
0
ACL_5K
IPC_5K
FW_5K
0
1
2
3
4
79.1
326.6
59.7
164.7
694.3
53.3
160.6
674.4
52.2
102
670.9
52
95
669.2
Filter Push Level
ACL_5K
IPC_5K
FW_5K
3000
FW_5K
FW_5K
Bytes / Classification (Worst
Case)
IPC_5K
IPC_5K
3500
Bucket Size
ACL_5K
50034
Space Factor
IPC_5K
FW_5K
4500
3500
2500
1500
500
ACL_5K
IPC_5K
FW_5K
0
676
1000
1
2
3
838
2168
2170
1300
2926
2948
3106
4630
4630
Filter Push Level
ACL_5K
IPC_5K
4
2170
2948
4630
FW_5K
Figure 12.4: Hypercuts Performance Evalutaion for filter set size 5K.
and ACL, and 32 in case of IPC.
Filter set size 10K
Figure 12.5 shows the HyperCuts performance evaluation results for filter set sizes
of 10K. For the FW filter set, a space factor of one or two means large storage
and good performance, a space factor of eight means low storage and worse performance. In increasing the bucket size, we can reduce storage while maintaining the
performance. But still 788 bytes per filter are needed for a worst case classification
110
Chapter 12. Packet Classification using R-trees
Bytes / Classification
(Worst Case)
Bytes / Filter
5000
4000
3000
2000
1000
0
ACL_10K
IPC_10K
FW_10K
1
2
45.6
4
IPC_10K
124.6
211.6
47.8
Bytes / Filter
1000
10
16
131.2
75.5
32
58.6
54.2
3453.4
1631.1
1143.7
803.9
Bucket Size
IPC_10K
40
52.7
1085.3
788.8
Bytes / Filter
1000
ACL_10K
IPC_10K
FW_10K
75.5
3453.4
1
2
3
36
33.6
33.6
64.1
43.8
37.1
788.8
788.8
780
Filter Push Level
ACL_10K
IPC_10K
FW_10K
2
4
33.6
34.7
780
4
8
664
616
1460
22648
3426
3430
Space Factor
3506
IPC_10K
548
22648
10594
FW_10K
4200
3200
2200
1200
200
10
16
544
532
ACL_10K
2000
0
1
968
FW_10K
3000
0
0
ACL_10K
IPC_10K
FW_10K
Bytes / Classification
(Worst Case)
ACL_10K
24
5000
ACL_10K
2000
0
10000
FW_10K
3000
ACL_10K
IPC_10K
FW_10K
15000
ACL_10K
IPC_10K
FW_10K
Bytes / Classification
(Worst Case)
ACL_10K
20000
8
58.6
81.7
214
109.4
1143.7
4038.1
Space Factor
1187.9
25000
24
32
40
664
704
640
712
3426
3426
Bucket Size
IPC_10K
920
848
3426
FW_10K
25000
20000
15000
10000
5000
0
ACL_10K
IPC_10K
FW_10K
0
532
640
1
2
3
704
852
852
22876
22876
22876
3426
3426
3846
Filter Push Level
ACL_10K
IPC_10K
4
852
22876
3846
FW_10K
Figure 12.5: Hypercuts Performance Evalutaion for filter set size 10K.
performance of 3426 bytes. For high memory efficiency (48 bytes per filter), the
algorithm inspects 10594 bytes per classifiction in the worst case and 5298 on average. In comparison, R* needs 54 Bytes per filter, 16786 bytes per classification
in the worst and 6398 on average. For the FW set, the algorithm shows to be
relatively insensitive to the number of push levels.
As can be seen, for the IPC filter sets, the algorithm is highly sensitive to the space
factor. Choosing a bad space factor, the classification performance degrades with
a factor of 15.
Setting the push level to zero, a space factor of four results in high classification
12.1. Performance evaluation
performance (640 bytes per classification), yet with high storage consumption. Increasing the bucket size greatly reduces storage with only a slight deterioration of
performance. But still above 1K bytes per filter are necessary. When the filter
pushing optimization is disabled, i.e., the filter push level is set to zero, the storage
use is very inefficient for IPC filter sets, yet the throughput is the best. When the
filter push level is increased, the storage efficiency is significantly improved; however, the throughput becomes much worse. Choosing a space factor of two, push
level of one and bucket size 16 seems to give the best tradeoff (214 bytes per filter,
1460 bytes per classification). With these parameters, HyperCuts needs almost
four times more memory than R*, but has a factor of 17 better classification time.
With high memory efficiency, HyperCuts needs 22876 bytes per classification and
consumes 34 bytes per filter. In comparison, R* needs 56 bytes per filter and reads
24478 bytes per classification operation.
Using the ACL filter set, HyperCuts greatly outperforms the R*-tree in view of
worst case bytes per classification. In terms of storage, the algorithms are on par.
In simulation (i), the bucket size was set to 24 (FW, ACL) and 16 (IPC) respectively. The push level was set to 1 (FW, IPC) and zero for the ACL filter set.
In simulation (ii) and (iii), the space factor was set to two (FW, ACL) and four
(IPC). The push level was set to 1 in case of FW, and zero for the ACL and IPC
filter sets. In simulation (iii) the bucket size was set to 40 (FW), 24 (IPC) and 16
(ACL).
Summary of results
The R*-tree has shown to scale well to large filter sets in terms of memory consumption for all three classification types.
Choosing high memory efficiency for HyperCuts, R* and Hypercuts are on par
in terms of memory consumption for the FW filter sets. In case of 1K, R* even
outperforms HyperCuts in terms of classification performance with a factor of approximately 2. For the remaining FW filter set sizes, HyperCuts is only about a
factor of 1.5 to 2 better than R*, cf. Figure 12.6.
Choosing high memory efficiency for HyperCuts, both algorithms are on par in
terms of memory consumption as well as classification performance for the IPC
10K set. Choosing good tradeoffs, HyperCuts needs up to four times more storage,
but has better classification performance. Hypercuts greatly outperforms the R*tree using the ACL filter set in view of worst case bytes per classification. In terms
of storage, the algorithms are on par.
12.1.4
Benchmark of R*-tree and RFC
Figure 12.7 shows the RFC performance evaluation results in terms of bytes per
filter as well as the number of bytes per classification for the ACL, FW and IPC
classification types. These results can also be found in the evalutaion of [125]. The
111
112
Chapter 12. Packet Classification using R-trees
FW
800
700
600
500
400
300
200
100
0
R*
HyperCuts
100
1000
5000
Bytes / classification
(worst case)
Bytes / filter
FW
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
R*
HyperCuts
100
10000
Bytes / filter
60
50
40
R*
HyperCuts
30
20
10
0
1000
5000
# of filters
5000
10000
FW
10000
Bytes / classification
(worst case)
FW
70
100
1000
# of filters
# of filters
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
R*
HyperCuts
100
1000
5000
10000
# of filters
Figure 12.6: Benchmark results for the FW filter sets. HyperCuts’ parameters
tuned for (i) above: high classification performance, (ii) below: high memory
efficiency.
RFC implementation is provided in a 3- and 4-phase configuration. According
to reported experimental results, there is a slight improvement on the lookup
throughput performance with a decreasing number of phases, but the storage can
become worse in orders of magnitude. In our simulation, we use the 4-phase
configuration. The lookup throughput gets slightly worse when the filter sets
become larger, cf. Figure 12.7. The memory consumption does not necessarily get
worse when the filter sets become larger, as in the case of ACL filter sets. Yet,
RFC shows severe scalability problems for the FW and IPC filter sets in terms of
storage consumption. RFC uses approximately 4.8 MB for FW filter set sizes of
100 and 11 MB for sizes of 1K. For the IPC 5K set, RFC consumes 175 MB in
total.
113
12.2. Conclusions and future directions
1.E+05
20
Bytes / Classification
Bytes / Filter
19
1.E+04
1.E+03
18
17
16
15
14
13
12
1.E+02
100
1000
# filters
ACL
IPC
5000
10000
100
1000
5000
10000
# filters
FW
ACL
IPC
FW
Figure 12.7: RFC performance evaluation.
12.2
Conclusions and future directions
According to our benchmark results, R* is competitive with HyperCuts for static
packet classification for FW filter sets.
Network state changes (e.g., link failures) along the policy-based routes or dynamic
topologies (ad hoc networks) are scenarios where policies need to get updated.
Most existing packet classification solutions do not support (fast) incremental updates. RFC’s extensive precomputation precludes dynamic updates at high rates.
HyperCuts’s support for incremental updates is not specifically addressed. While
it is conceivable that the data structure can support a moderate rate of randomized updates, it appears that an adversarial stream of updates can either create
an arbitrarily deep decision tree or force a significant restructuring of the tree [38].
The preprocessing time can be taken as an indication for the time needed for the
reconstruction of the structure. Precomputation can be defined as the process of
transforming the representation of a filter database (i.e., the way in which the filters are expressed and stored) to represent that same data in a way more suitable
to the classification procedure. Taking the FW 10K filter set for example, our
simulations measured a preprocessing time of 7.4 seconds on a 3 GHz Pentium-IV.
When an insertion or deletion of a filter is triggered, a delay of several seconds
is unacceptable. A strength of the R*-tree is its support of incremental updates.
Investigating the performance of R*-trees in a dynamic classification environment,
i.e., where classification is intermixed with filter updates, would be an interesting future research field. To our knowledge, there is no prior work that presents
simulation results of a dynamic packet classification scenario. Gupta and McKeown [104] present experimental results of the average update time over 10000
random incremental updates, but not how these affect classification performance.
In static environments, the fact that the data is known in advance is used in order
to build a structure that supports queries as efficient as possible. This method is
114
Chapter 12. Packet Classification using R-trees
well known in the literature as “packing” or “bulk loading”. Packed R-trees, e.g.,
the Priority R-tree as proposed by Arge et al. [122], might qualify as a structure
to be applied in static packet classification environments.
We have seen that during classification, each entry per node has to be checked
if it contains the packet to be classified. Further, more than one subtree under
a node may have to be visited to locate the highest priority filter matching a
packet. To speed up the search, all entries in a node could be queried in parallel.
Furthermore, several branches of the tree can be searched in parallel. The total
number of bytes that are inspected remain the same, but classification speed will
be thus increased. Therefore, the parallelism offered by hardware can be leveraged.
A current generation Xilinx FPGA operates at over 550 MHz and contains over
10Mb (1.25 MB) of embedded memory. The R*-tree has shown to be very space
efficient, even in the case of 10k filters it requires less than 0.6 MB to store the
filters. Additional research is required to evaluate R*-tree’s performance when
implemented directly in hardware.
Part III
Conflict Detection and Resolution
Chapter 13
Introduction
Policy-based routing requires network routers to examine multiple fields of the
packet header in order to categorize them into “flows”. Flow identification entails
searching a table of predefined filters to identify the appropriate flow based on
criteria including IP address, port, and protocol type. If the header of an arriving
packet matches more than one filter, a tiebreaker determines the filter which is
to be applied. A common tiebreaker is to select the highest priority filter among
all matching filters. Hari et al. noticed that not every policy can be enforced by
assigning priorities [7]. The authors suggest to employ the most specific tiebreaker
(MSTB). In one-dimensional prefix tables, the most specific filter is equivalent to
the longest matching prefix. The most specific criterion is a special case of the
highest priority criterion: a filter f1 is assigned a higher priority than a filter f2 if f1
is more specific than f2 . A filter f1 is more specific than a filter f2 iff f1 ⊂ f2 . The
most specific tiebreaker is only feasible if for each packet p the most specific filter
that applies to p is well defined. Otherwise, the filter set is said to be conflicting.
The conflict detection problem occurs in two variants; the offline and online modes.
Algorithms for the offline version are given a set of filters and report (and resolve)
all conflicts. The online version, on the other hand, is a gradual build-up of a
conflict-free set R, such that for every insertion and deletion of a range r, a check
is made on the current status of R and if conflicts occur, the algorithms will offer
solutions to resolve them and maintain the conflict-free property of R.
In this part we propose a conflict detection and resolution algorithm for static
one-dimensional range tables, i.e., where each filter is specified by an arbitrary
range. We are motivated to study the one-dimensional case for the following
reason. Multi-dimensional classifiers typically have one or more fields that are
arbitrary ranges. Since a solution for multi-dimensional conflict detection often
builds on data structures for the one-dimensional case, it is beneficial to develop
efficient solutions for one-dimensional range router tables. This line of research
has been conducted in collaboration with Khaireel Mohamed, Thomas Ottmann
and Amitava Datta. The “Slab-Detect” and naı̈ve algorithms presented in this
part were implemented by my collegue Khaireel Mohamed.
118
Chapter 13. Introduction
13.1
Organization of this part
In the following two sections we introduce the terminology we use and describe related work [128]. After presenting our conflict detection and resolution algorithm
for one-dimensional range tables in section 14.1, we provide our experimental results of benchmarking the new solution with a naı̈ve algorithm. Section 14.3
describes how the algorithm can be adapted to work under the highest-priority
tiebreaking rule. Motivated by our main application, we consider a related problem in section 14.4. We show that by making use of partial persistence, the data
structure can also support IP lookup.
13.2
Preliminaries
A one-dimensional filter applies to a packet if the range that represents the filter
contains the point representing the packet. A point p is said to stab a range [u, v]
if p ∈ [u, v]. A stabbing query reports all ranges that are stabbed by a given query
point.
Let R be a set of one-dimensional arbitrary ranges. The MSTB rule is only feasible
if the set R is conflict-free.
Definition 2. The set R is conflict-free iff for each point p there is a unique
range r ∈ R such that p ∈ r and for all other ranges s ∈ R stabbed by p, i.e.,
p ∈ s, r ⊆ s.
If the set R of ranges is conflict-free then for each point p there is a well defined
most specific filter in R which contains p.
Definition 3. Two filters r and s partially overlap if r ∩ s 6= ∅ and r ∩ s 6= r
and r ∩ s 6= s.
Definition 4. A set R of filters is nested if for any pair r, s ∈ R, either r ⊂ s
or s ⊂ r.
Definition 5. A set R of ranges is called nonintersecting if for any two ranges
r, s ∈ R either r ∩ s = ∅ or r ⊂ s or s ⊂ r.
In other words, R is nonintersecting if any two ranges are either disjoint or one
is completely contained in the other. It is obvious that a set of nonintersecting
ranges is always conflict-free, i.e., for each packet (query point p) the most specific
range in R containing p is well defined. There may be, however, conflict-free sets
of ranges which are not nonintersecting. Consider, e.g., a set R of three ranges
{r, s, t}, where r and s partially overlap and t = r ∩ s.
119
13.2. Preliminaries
Definition 6. Two ranges r, s ∈ R are in conflict with respect to R if r and s
partially overlap and there is a point p such that p ∈ r ∩ s but there is no range
t ∈ R such that p ∈ t and t ⊂ r and t ⊂ s.
Definition 7. Let R = {r1 , . . . , rn } be a set of n ranges. Then
Q
(R) =
n
S
ri
i=1
(see [21]).
Lemma 1.
Q If two ranges r, s ∈ R are in conflict then there is no subset S ⊆ R
such that (S) = r ∩ s.
Proof. Let r, s be two conflicting ranges. ConsiderQthe overlapping range r ∩ s and
assume that there is a subset S ⊆ R such that (S) = r ∩ s. Take t ∈ S and
p ∈ t arbitrarily. Then r, s and t contain p and t ⊂ r and t ⊂ s contradicting the
assumption that r and s are in conflict.
The reverse of the above lemma is also true. We show:
Lemma 2. If r, s ∈ R are conflict-free
Q then either r ∩ s = ∅ or r ⊆ s or s ⊆ r or
there is a subset S ⊆ R such that (S) = r ∩ s.
Proof. It is sufficient to assume that r∩s 6= ∅ and neither r ⊆ s nor s ⊆ r. Because
r and s are conflict-free, there exists for each
S p ∈ r ∩ s aQrange tp ∈ R such that
p ∈ tp and tp ⊂ r and tp ⊂ s. Choose S =
tp . Then (S) = r ∩ s.
p∈r∩s
With the MSTB rule in mind, we know for each point p ∈ r∩s, r, s two nonconflicting but partially overlapping ranges, that neither r nor s is the filter determining
the routing of p. Intuitively speaking, we consider only those pairs of filters r
and s as conflicting in R if there are points in r ∩ s for which lookup can not be
transfered to more specific ranges.
Hari et al.’s [7] definition of a resolve filter for two-dimensional prefix filters naturally translates to one dimension.
Definition 8. Let r, s ∈ R be two conflicting ranges. Then we call the overlapping
range r ∩ s the resolve filter for r and s with respect to R. We denote by resolve
(R) the set obtained from R by adding a resolve filter for every pair of conflicting
filters in R.
120
Chapter 13. Introduction
Figure 13.1: Intervals and their relative position.
Theorem 2. Let R be a set of one-dimensional ranges. Then R ∪ resolve(R) is
a conflict-free set of filters.
Proof. We prove this by induction [129].
Basis : The basis of the induction is a set of two ranges and the resolve filter
induced by them. There are three cases, (i) the two ranges are disjoint, (ii) one
of them encloses the other, and (iii) they partially overlap. There is no need to
introduce any resolve filter in the first two cases and hence the theorem holds. It
is easy to see that the resolve filter introduced in the last case does not introduce
any conflict.
Induction : Suppose we have a set of i − 1 ranges and their resolve filters already
introduced and this set is conflict-free. We have to prove that the resulting set is
still conflict-free when we introduce a new range in this set as well as the resolve
filters for this new range. We refer to Fig. 13.1 for the proof.
All the ranges existing before the introduction of the new range can be classified
into five categories : (i) the ranges that left overlap the new range, (ii) the ranges
that right overlap the new range, (iii) the ranges that enclose the new range, (iv)
the ranges that are enclosed by the new range, and (v) the ranges and the new
range are disjoint.
Note that the ranges in category (v) or their resolve filters cannot cause any
conflict after the introduction of the new range, because none of these resolve
filters overlap the newly introduced range. Similarly, the ranges or their resolve
filters in category (iv) do not introduce any conflict as these ranges and their
resolve filters are nested within the newly introduced range.
Hence, we have to concentrate on the ranges in the first three categories. However,
the proof is almost similar for the ranges in categories (i) and (ii) and we will only
consider the proof for ranges in category (i), i.e., the ranges that left overlap the
new range and category (iii), i.e., the ranges that enclose the new range. According
to our definition of conflict, left- or right-overlap does not necessarily cause a
conflict. We only insert resolve filters when there is a conflict, otherwise, we only
insert the new range. For each of these categories, we have to prove, (a) the old
13.2. Preliminaries
Figure 13.2: Any two ranges in RFi and RFi−1 are conflict-free.
resolve filters do not conflict with the resolve filters introduced for the new range,
and (b) the old resolve filters do not conflict with the new range. If the new range
overlaps with existing ranges but there is no conflict, then (a) is straightforward
to prove, since in this case there are no newly inserted resolve filters. Hence in the
following, we will only consider ranges that overlap and conflict. We denote all
the ranges in category (i) by L. Suppose we are adding the i-th range ri and its
low and high end points are li and hi respectively. Consider a range rk ∈ L that
left overlaps and conflicts with ri . The low and high end points of rk are lk and hk .
We can sort all the ranges that left overlap ri according to their right end points.
After the introduction of ri , we introduce a resolve filter for a range rk ∈ L (as
rk conflicts with ri ) and this resolve filter is a new range [li , hk ]. Clearly, all these
new resolve filters (we call this set RFi ) due to ri are not in conflict as they are
nested.
Consider now the resolve filters (we call this set RFi−1 ) that existed due to the
ranges in L before the introduction of ri . We have to check whether the resolve
filters in RFi−1 are in conflict with the resolve filters in RFi . We prove by contradiction that there is no such conflict. Assume that there is a resolve filter
rfjk ∈ RFi−1 such that rfjk has a conflict with some resolve filter in RFi . rfjk was
introduced for resolving the conflict of two ranges rj and rk before the introduction
of ri . Suppose rfjk conflicts with rfim ∈ RFi . Clearly, the left end point of rfjk
is to the left of li as both rj and rk left overlap ri . The left end point of rfim is li
as this resolve filter was introduced due to ri . Consider now rfim and the part of
rfjk to the right of li (the part of rfjk to the left of li remains conflict-free after
the introduction of ri ). See Fig. 13.2.
There are two cases as shown in Fig. 13.2, depending on the relative positions of
the end points of rj , rk and rm . First, we consider Fig. 13.2(a). In this case, the
right end point of rfim is to the right of the right end point of rfjk . Note that the
right end point of rfjk is either due to the right end point of rj or due to the right
end point of rk (as rfjk is a resolve filter for rj and rk ). Suppose, without loss of
generality, the right end point of rfjk is due to the right end point of rj . Then we
have already introduced a resolve filter that starts at li and ends at the high end
point of rj , since we have added a resolve filter to resolve the conflict between rj
and ri after we added ri . This resolve filter for ri and rj is shown by the dashed
line in Fig. 13.2(a). This resolve filter for ri and rj resolves the conflict between
121
122
Chapter 13. Introduction
Figure 13.3: A range that encloses ri cannot have any conflict due to the introduction of ri .
rfim and rfjk . Next, consider Fig. 13.2(b), i.e., where the right end point of rfim
is to the left of the right end point of rfjk . Clearly, there is no conflict, as rfim is
nested in rfjk . We can prove in a similar way that the resolve filters in RFi−1 do
not conflict with the new range ri .
Finally, we have to consider the ranges in category (iii), i.e., the ranges that enclose
ri . Suppose rp is such a range. Clearly, ri is not in conflict with rp . Hence, (a) is
straightforward to prove. However, we have to check whether any resolve filter due
to rp is in conflict with ri . Suppose there is a range rl which has a conflict with
rp and we have introduced a resolve filter earlier to resolve this conflict. Clearly,
if rl ∩ ri = ∅ then their resolve filter does not conflict with ri . There are three
possibilities left as shown in Fig. 13.3.
In the first case, the resolve filter due to rp and rl has a conflict with ri (Fig.
13.3(a)). However, in this case rl left overlaps ri and hence we have already
introduced a resolve filter (shown by the dashed line in Fig. 13.3(a)) to resolve
the conflict between rl and ri which also resolves the conflict between ri and the
resolve due to rp and rl . In the second case, rl right overlaps ri and the proof is
similar. In the last case rl encloses ri and hence the resolve filter for rp and rl also
encloses ri .
This concludes the proof.
Figure 13.4 shows an example for a set R of one-dimensional filters: r and s are
not conflicting, because for each point p ∈ r ∩ s there is a range t ∈ R such that
p ∈ t and t ⊂ r and t ⊂ s. However, the pairs (a,b), (b,c), (c,d) are all conflicting
pairs of filters. Hence, R is not conflict-free. Adding a ∩ b, b ∩ c, c ∩ d to R results
in a conflict-free set of filters, i.e., R ∪ resolve(R) = R ∪ {a ∩ b, b ∩ c, c ∩ d} is
conflict-free.
13.3. Related work
Figure 13.4: Filters r and s are conflict-free. However, the set as a whole is not
conflict-free under MSTB.
Figure 13.5: Any pair ri , rj ∈ R, i 6= j is conflicting with respect to R.
This definition of conflict implies that there may be sets of n one-dimensional
ranges having O(n2 ) pairs of conflicting ranges. Figure 13.5 shows an example of
a set R of n ranges. Here, any pair ri , rj ∈ R, i 6= j is conflicting with respect to
R. Resolve(R) = {ri ∩ rj |1 ≤ i, j ≤ n, i 6= j} contains n(n−1)
elements. Though
2
R ∪ resolve(R) is conflict-free according to Theorem 2 it is not necessary to add a
resolve filter for each pair of conflicting filters in R in order to make it conflict-free.
We can show:
Lemma 3. For every set R of n one-dimensional ranges, there is a set S of O(n)
one-dimensional ranges s.t. R ∪ S is conflict-free.
Proof. Let R = {r1 , . . . , rn }. These ranges partition the universe U into at most
2n−1 consecutive slabs defined by the endpoints of the ranges. Let ep0 , ep1 , . . . , epk
be the boundaries of these slabs. Let σ = {[epi , epi+1 ], 0 ≤ i < k}. Then R ∪ σ is
obviously conflict-free.
Hence, the trivial solution to make a set conflict-free is to add a “slab-resolve” filter
for each of the slabs. Yet, this solution possibly adds unnecessary filters since not
every slab may require a resolve filter, e.g., when the set is already conflict-free.
Therefore, our goal is to add only those slab-resolve filters that are needed to make
the set conflict-free.
After presenting related work, we will propose our output-sensitive offline conflict
detection and resolution algorithm for one-dimensional range tables.
13.3
Related work
13.3.1
Online conflict detection and resolution
For a given set R of n one-dimensional nonintersecting ranges under MSTB, the
deletion of intervals maintains the conflict-free property of R. Hence, only the
123
124
Chapter 13. Introduction
(a)
(b)
Figure 13.6: Examples of partially overlapping ranges.
insertion of a new interval may become critical for the online variant of the problem.
In order to determine whether a new range r = [u, v] partially overlaps with any
of the ranges in R, two conditions have to be checked:
1. ∃s = [x, y] ∈ R : x < u ≤ y < v (s left-overlaps r, cf. Figure 13.6(a))
2. ∃s = [x, y] ∈ R : u < x ≤ v < y (s right-overlaps r, cf. Figure 13.6(b))
Lu and Sahni [21] maintain two priority search trees (PST) to detect all conflicts. However, the second PST is maintained exclusively for the detection of
right-overlaps (IP lookup is performed on the first PST). The actual information contained in the second PST is redundant. The overall insertion time is in
O(log n). However, the actual time for insertion (and deletion), as well as the
space requirements of the overall approach, are increased by roughly a factor of
two. Lauer et al. [46] show that such an additional structure is not required and
that the verification of condition (2) can also be achieved in time O(log n) by a
single query on the original structure (plus one comparison operation).
If R is made up of arbitrary one-dimensional ranges under MSTB, then both
operations to insert and delete an interval may lead to conflicts in the resulting
set. In the case of an insertion, the new interval may left- or right-overlap with one
or more intervals in R, such that the overlapping range has no resolving subset.
Similarly, a conflict may arise if we remove t = r ∩ s from R, because t ∈ R is an
interval that resolves a conflict between r, s ∈ R, see Figure 13.7.
Figure 13.7: Removing t will lead to a conflict between r and s.
This online version of the problem is solved (in a rather complex way) by Lu and
Sahni [21].
Hari et al. [7] introduced the notion of filter conflict under the most specific
tiebreaking rule. Hari et al.’s motivation to apply the MSTB is that the scheme for
125
13.3. Related work
resolving ambiguities in classification that is based on prioritizing the filters (and
then choosing the filter with highest priority) is not able to enforce every policy.
Let each filter f be a 2-tuple (f [1], f [2]), where each field f [i] is a prefix bit string.
Assuming that the most specific tiebreaker is applied, two filters f1 and f2 have a
conflict iff [7]
1. f2 [1] is a prefix of f1 [1] and f1 [2] is a prefix of f2 [2] or
2. f1 [1] is a prefix of f2 [1] and f2 [2] is a prefix of f1 [2]
Figure 13.8 provides an example of two conflicting filters. For packets falling in
the overlap region, the most specific filter is not defined.
Figure 13.8: Filters f1 and f2 are in conflict.
Hari et al. propose a new scheme for conflict resolution, which is based on the
idea of adding a resolve filter f1 ∩ f2 for each pair f1 , f2 of conflicting filters [7].
This guarantees that the most specific tiebreaker can be employed. For packets
falling in the overlap region, the resolve filter determines the action that is to be
applied. Note that this definition of conflict disregards the fact that the overlap
region o may already be exactly covered by another filter or set of filters whose
union equals o, hence this approach may introduce unnecessary resolve filters.
In order to detect all conflicts between a new filter and a given set of filters, they
utilize two complementary data structures, one for each of the two cases listed
above. When a new filter is to be inserted, we search the first structure and report all filters that satisfiy condition (1). All filters satisfying condition (2) can
be found by searching the second structure. The algorithm adds resolve filters for
each pair of conflicting filters. Conflict detection takes time O(w2 ), where w is the
width of each field. It is possible to reduce this to O(w) using switch pointers.
The drawback is that the involved precomputation raises the filter update time to
O(n), where n is the number of current filters. The authors further extend their
algorithm to three dimensions where the protocol field is restricted to be either
TCP, UDP or wildcard. In this case, the time for conflict detection as well as
the algorithm remains unchanged. The only overhead is the three fold increase in
memory for filters with wildcard protocol field. They further extend their algoritm
126
Chapter 13. Introduction
to five dimensions, in which case the source and destination ports are restricted to
be either fully specified or wildcard.
Al-Shaer and Hamed developed the Policy Anomaly Detector, a set of tools to
manage filtering policies [130]. It discovers conflicting filters and automatically
determines the proper order for any inserted or modified filter. It further provides
a natural language translation of low-level filters.
13.3.2
Offline conflict detection and resolution
Lu and Sahni [131] consider the two-dimensional offline problem under MSTB
for sets of prefix filters. Two filters f1 , f2 ∈ F conflict iff an edge of f1 perfectly
crosses an edge of f2 , that is, two edges perfectly cross iff they cross, and their
crossing point is not an endpoint. Note that this definition of conflict is based on
the definition by Hari et al. [7]. In other words, a proper edge intersection between
two ranges in F is a direct cause for conflict. This implies that all conflicts in F
can be detected by computing all proper intersecting pairs of filters in F . This can
be done by a slight modification of the classical sweepline algorithm for reporting
all intersections in a set of n iso-oriented line segments. Lu and Sahni [131] further
discuss the problem of resolving conflicts by adding a set of resolve filters to F .
For each conflicting prefix-pair f1 , f2 ∈ F , a new resolve filter h = f1 ∩ f2 is added
to F , cf. Figure 13.9.
If a resolve filter h is already in the original set of filters F , or if the original
Figure 13.9: Conflict-free set of prefix filters, colored regions represent the set of
resolve filters.
set contains a set of filters whose union equals h, then we can avoid adding h to
resolve(F). Therefore, the authors introduce the notion of an essential resolve
filter. A filter h ∈ resolve(F ) is an essential resolve filter iff F ∪ resolve(F ) − {h}
has no subset whose union equals h.
It takes O(n log n+s) time to determine resolve(F), where n is the number of filters
in F and s the number of resolve filters and an additional time of O((n + s)w),
where w is the length of the longest prefix, to identify the set of essential resolve
127
13.3. Related work
filters.
A set of prioritized rectangles has a conflict if there exists a query point p such
that there is no unique maximum priority rectangle containing p. The only known
result for conflict detection for prioritized ranges (not only prefix ranges) in two
dimensions is the algorithm proposed by Eppstein and Muthukrishnan [132]. Their
algorithm uses a technique related to an algorithm by Overmars and Yap [133]
devised for solving Klee’s measure problem: Given a collection of n d-dimensional
rectangles, compute the measure
of their union. Overmars and Yap proposed an
√
d/2
O(n log n) time and O(n n) space sweepline algorithm for d ≥ 3. In three
dimensions, their algorithm uses a generalization of a 2-d-tree, a two-dimensional
orthogonal partition tree, which defines a subdivison of the plane into rectanguar
cells. In two dimensions, the partition created has the properties
that there are
√
O(n) cells, that
no
rectangle
is
contained
in
more
than
O(
n)
cells,
and that no
√
more than O( n) rectangles need to be examined within each cell. The cells are
stored in the leaves in the partition tree. Each cell has the form of a trellis, not
containing vertices in its interior, cf. Figure 13.10. The measure of each trellis
Figure 13.10: A trellis.
can be easily computed. The overall measure is computed using the information
that is maintained in the partition tree under insertions and deletions of rectangles
during the sweep.
Eppstein and Muthukrishnan construct a 2-d-tree of the rectangle vertices to divide
the plane into rectangular cells not containing any rectangular vertex. Then, they
perform a depth first traversal of the tree. However, this scheme only yields a yes
or no answer to the offline version of the conflict detection problem and does not
report and resolve the conflicts. Their algorithm runs in time O(n3/2 ) and uses
linear space.
Chapter 14
Detecting and Resolving Conflicts
In the following we present our solution for an offline conflict detection and resolution algorithm for one-dimensional range tables [129] [134]. The algorithm is based
on the sweepline technique, achieves a worst case time complexity of O(n log n)
and uses O(n) space, where n is the number of filters in the set. The algorithm is
output-sensitive in the sense that it reports only essential resolve filters.
14.1
The output-sensitive solution to the one dimensional offline problem
Let R be a set of n one-dimensional arbitrary range filters under MSTB. The left
and right endpoints of the filters in R partition the discrete universe into at most
2n − 1 slabs. Each distinct endpoint makes up the boundary between two slabs
and is placed in a linearly ordered set to form the event points of the sweepline
paradigm.
Definition 9. A slab σi is a single partition of the set R containing a range
of points from the discrete universe between two event points epi and epi+1 , where
σi = [epi , epi+1 ). All points p ∈ σi stab the same subset Si ⊆ R and are represented
collectively by epi .
Definition 10. Slab σi is conflict-free under MSTB iff there is a shortest filter
r ∈ Si that contains epi , such that r is contained in all other filters Si ⊆ R stabbed
by epi .
If the shortest (most specific) filter r contains epi , then it contains all points p in
slab σi . Thus, we detect a conflict situation at epi if the shortest filter r is not
contained in at least one s = [s.lo, s.hi] ∈ Si .
130
Chapter 14. Detecting and Resolving Conflicts
Definition 11. Slab σi is non-conflict-free under MSTB iff the shortest filter r is
not contained in at least one s ∈ Si .
Lemma 4. Let σi be a non-conflict-free slab. Then σi requires only a single
“slab-resolve” filter hσi = [epi , epi+1 ) to make it conflict-free.
Proof. If σi is not conflict-free, then the shortest filter r stabbed by epi is not
contained in all other filters in Si . Let hσi be a resolve filter that spans σi . Then
for the same epi , lengths ||hσi || < ||r||, so that hσi is now the shortest filter that
contains epi and is contained in all Si (including r).
Definition 12. A filter r = [r.lo, r.hi] left-overlaps filter s = [s.lo, s.hi] iff r.lo <
s.lo ≤ r.hi < s.hi.
Definition 13. A filter r right-overlaps s iff s.lo < r.lo ≤ s.hi < r.hi.
Corollary 1. Slab σi requires hσi if ∃s ∈ Si that left-overlaps r and s.hi > epi .
Corollary 2. Slab σi requires hσi if ∃s ∈ Si that right-overlaps r and s.lo ≤ epi .
Theorem 3. Let SlabResolve(R) be the set obtained from R by adding a slabresolve hσi for every non-conflict-free slab σi . Then R∪SlabResolve(R) is conflictfree.
Proof. After adding a slab-resolve hσi for every non-conflict-free slab σi , there
exists a most specific filter in every single slab. Hence, by Definition 2, the set
R ∪ SlabResolve(R) is conflict-free.
Therefore, it is sufficient to (i) determine the smallest filter r from all Si stabbed
by epi , and then (ii) check that r is contained in all Si to deduce that slab σi is
conflict free.
14.1.1
Status structures
A filter s ∈ R belongs to the status structure T at epi iff epi ∈ [s.lo, s.hi), and
all such filters are stored in a collective structure ordLen(T ), which orders the
filters according to ascending lengths. T can be split into two distinct subsets,
so that T = new(T ) ∪ current(T ), where s ∈ new(T ) if s.lo = epi , otherwise,
s ∈ current(T ).
14.1. The output-sensitive solution to the one - dimensional offline problem
The filters in new(T ) are ordered in a single structure in their ascending lengths.
The filters in the generic set current(T ), on the other hand, are maintained in
two separate red-black trees; (i) curR (T ) that orders all s in their ascending hiendpoints, and (ii) curL (T ) that orders all s in their descending lo-endpoints.
Furthermore, if current(T ) 6= ∅, then let lhp and hlp be pointers to the lowest of
all the hi-endpoints and the highest of all the lo-endpoints in current(T ). Respectively, lhp and hlp point to the top of the ordered lists in curR (T ) and curL (T ).
14.1.2
Handling event points
At each event point epi , we take the shortest filter r ∈ T from the top of the list
in ordLen(T ) and check if r ∈ new(T ), otherwise, r ∈ current(T ). Here, there are
two cases to consider:
Case I: The shortest filter r ∈ new(T ).
Case II: The shortest filter r ∈ current(T ).
Theorem 4 (Case I). If r ∈ new(T ) and current(T ) = ∅, then slab σi is conflictfree.
Proof. At epi all s ∈ new(T ) have the same lo-endpoints, and since r ∈ new(T ),
r is completely contained in all s ∈ new(T ).
Theorem 5 (Case I). If r ∈ new(T ) and lhp ≥ r.hi, then slab σi is conflict-free.
Proof. From Theorem 4, r cannot conflict with any s ∈ new(T ). Also, by Definition 9, slab σi cannot span any longer than ||r||1 , which then makes it unnecessary
to consider the case for lhp ≥ r.hi in slab σi .
Theorem 6 (Case I). If r ∈ new(T ) and lhp < r.hi, then ∃ at least one filter
s ∈ current(T ) that left-overlaps r, so that slab σi is no longer conflict-free.
Proof. Straightforward from Corollary 1.
Theorem 7 (Case II). If r ∈ current(T ) and new(T ) 6= ∅, then every s ∈ new(T )
conflicts with r, so that slab σi is no longer conflict-free.
Proof. Straightforward from Corollary 2; ∀s ∈ new(T ) right-overlap with r.
Theorem 8 (Case II). If r ∈ current(T ) and new(T ) = ∅, then slab σi is conflictfree iff hlp ≤ r.lo and r.hi ≤ lhp.
Proof. It follows from Definition 10 that r is contained in all s ∈ current(T ) =
T.
1
However, ||σi || can be shorter than ||r|| if ∃epi+1 that comes before r.hi.
131
132
Chapter 14. Detecting and Resolving Conflicts
14.1.3
The sweepline environment
We conceptually extend all intervals by half the size of the distance of points in
the discrete raster U , such that all stabbing queries become queries for points
falling into the interior of the intervals. Figure 14.1 shows this transformation
on filters r, s, and t, such that for any range filter x there is a mapping function
f : x[x.lo, x.hi] 7−→ x0 [x.lo − 0.5, x.hi + 0.5]. This is important for the Slab-Detect
algorithm in order to detect all conflicts. For example, the point epm = 19 in
Figure 14.1(a) stabs both filters r and s. If the filters are not mapped as above,
then only s exists in the status structure at epm , and the conflict between r and s
is not detected. Also, we will miss t completely at epn = 21. Figure 14.1(b) shows
the solution to these problems, and, as a consequence, all event points epi lie on
the 0.5 mark.
Figure 14.1: Extending each filter in (a) by half the size of the distance of points
so as not to miss a crucial event point during the sweep in Slab-Detect.
14.1.4
Running Slab-Detect
Figure 14.2 illustrates the Slab-Detect algorithm on a given set R = {r, s, t, u, v, w},
where R partitions the discrete universe into 11 slabs, separated by ten distinct
event points. Sweeping from left to right, Slab-Detect maintains the status structure T = new(T ) ∪ current(T ) at every event point epi , determines the shortest
filter r∗ and notes which subset of T it comes from, and then reports whether or
not slab σi requires a resolve filter. In the given example, Slab-Detect finds that
the non-conflict-free slabs σ3 , σ6 , σ7 , and σ8 require resolve filters.
Corollary 3. Let g and h be two adjacent resolve filters reported by the Slab-Detect
algorithm that span two consecutive slabs σi and σi+1 . Then g and h cannot be
merged.
Proof. All adjacent event points on the sweepline contain a unique set of filters in
new(T ) and current(T ); that is, no two consecutive event points epi and epi+1 have
the same elements in the status structure T . Thus it follows that the conditions
14.2. Experimental results
Figure 14.2: Running from left to right, the Slab-Detect algorithm reports four
resolves required for the non-conflict-free slabs σ3 , σ6 , σ7 , and σ8 .
that make the filters in slab σi conflict is different from the conflict conditions in
σi+1 . Therefore, g and h cannot be merged.
Slab-Detect does not report duplicate resolve filters. Figure 14.3 shows an example.
Figure 14.3: Slab-Detect: duplicate resolve filters are not reported.
Further, as mentioned in section 13.3, Lu and Sahni [131] introduced the notion
of an essential resolve filter. Their solution reports all resolve filters at first, and
then, in a second step, identifies the essential resolve filters. Slab-Detect reports
the essential resolve filters ab initio.
14.2
Experimental results
The input data, i.e., the ranges, are randomly generated depending on two main
parameters: the total number n of filters to generate, and the bit-length w of a
filter field. An arbitrary range filter r = [r.lo, r.hi] is formed using the function
random(i, j) that generates a random integer that is normally distributed between
i and j, and is described as follows:
r.lo = random(0, 2w − 1),
r.hi = random(lo, 2w − 1)
We add variations of discrepancy into the generated set R by introducing an
additional parameter percentP ref ix, which determines the percentage of onedimensional prefix filters in R. That is, by setting percentP ref ix = 1.0, we
133
134
Chapter 14. Detecting and Resolving Conflicts
generate n non-conflicting prefix filters in R. By varying this parameter between
0.0 and 1.0, we can indirectly control the number of conflicting pairs of filters generated by the base set R, and in doing so, test the Slab-Detect algorithm on its
output-sensitivity.
A prefix filter is described by f = [b/pref ixLen], where b is a random bit-pattern,
and pref ixLen is the number of prefix-bits to retain in b. We generate it as follows:
b = random(0, 2w − 1),
pref ixLen = random(0, w − 1)
All of our simulations are performed on a Pentium IV, 3GHz machine with 2GBytes
RAM running on Java 5.0. We generate arbitrary range filters for w = 128 with
various percentages of prefix filters in steps of 10% increments, for sample sizes
|R| = 5K, 10K, 20K, 30K, 40K, and 50K. At each epoch of one sample size |R| and
one preset value of percentP ref ix, we note the total runtime, the resolve filters
reported, and the memory consumption. We repeat each epoch 20 times, each time
generating a new set of random samples for R, and then calculate the averages for
the noted values above. We benchmark Slab-Detect with a naı̈ve algorithm which
reports resolve filters for all pairwise conflicting filters following Definition 8, and
we term this set as “resolve(R)”.
Figure 14.4 shows our simulation results for |R| = 5K and w = 128. We see in
Figure 14.4(a) that the number of reported slab-resolves, which we collectively term
essential(R), is never more than the number of slabs generated by the filters in R.
Also, the total runtime for Slab-Detect decreases as the number of essential(R)
decreases, which is a consequence of increasing the percentage of prefix filters in the
sample R. This shows that Slab-Detect is indeed output-sensitive, as the amount
of time taken to report all necessary resolve filters in the set R is proportional
to the number of non-conflict-free slabs in R. Whereas the total runtime of the
naı̈ve algorithm is unaffected by the number of conflicting pairs within the sample
R as shown in Figure 14.4(b). From both these figures, we see that Slab-Detect
reports far less resolve filters than its naı̈ve opponent. Further, the runtime for
Slab-Detect is up to one order of magnitude faster than the naı̈ve algorithm. Note
that the time reported in Figure 14.4(b) is the time taken by the naı̈ve algorithm
to report the raw set of resolve(R), and it will take additional time to remove all
unnecessary filters in resolve(R). Slab-Detect, however, does not report duplicate
resolve filters ab initio.
In Figure 14.4(c), we report the average and maximum memory requirement to
handle Slab-Detect for |R| = 5K and w = 128. The figure also shows the average
amount of time taken to handle and process a single event point within SlabDetect, in accordance to the various percentage mixes of prefix filters in the set
R.
The complete round of simulation outputs for the Slab-Detect algorithm on |R| =
10K, 20K, 30K, 40K, and 50K are summarized in the graphs shown in Figure 14.5.
14.3. Adapting Slab-Detect under the HPF rule
Figure 14.4: Simulation results for Slab-Detect and Naı̈ve. |R| =5K, w = 128.
14.3
Adapting Slab-Detect under the HPF rule
In this section, we show that the original concepts of the Slab-Detect algorithm
under MSTB are adaptable for use under the highest-priority tiebreaking rule
(HPF). It also achieves the same runtime performances and space complexities as
discussed in Section 14.2. Under HPF, each filter r ∈ R is assigned a priority value
prio(r).
Definition 14. A filter r has a higher priority than filter s iff prio(r) < prio(s).
Definition 15. A set of filters R is conflict-free under HPF iff for each point p
there is a unique filter r of the highest priority that contains p, such that prio(r) <
prio(s), ∀s ∈ S ⊆ R stabbed by p.
135
136
Chapter 14. Detecting and Resolving Conflicts
(a) Slab-Detect runtime performance
(c) Slab-Detect
ments
(b) Reporting slabResolve(R)
memory
require-
Figure 14.5: Slab-Detect: Simulation results for |R| = 10K, 20K, 30K, 40K, 50K,
and w = 128.
14.4. Setting up IP lookup with Slab-Detect
Definition 16. Slab σi is conflict-free under HPF iff there is a highest priority
filter r ∈ Si that contains epi , such that prio(r) < prio(s), ∀s ∈ Si stabbed by epi .
14.3.1
Status structures
A filter s ∈ R belongs to the status structure T at epi iff epi ∈ [s.lo, s.hi). All such
filters are stored in a single red-black tree, ordP rio(T ), which orders the filters
according to ascending priorities.
14.3.2
Handling event points
At each event point epi , we query the top two filters r0 and r1 in ordP rio(T ).
Corollary 4. If prio(r0 ) 6= prio(r1 ), then slab σi is conflict-free.
Proof. This implies that filter r0 , which contains epi , has the highest priority
compared to all others in ordP rio(T ) stabbed by epi . Otherwise, if prio(r0 ) =
prio(r1 ), then slab σi is no longer conflict-free and requires a resolve filter hσi with
prio(hσi ) < prio(r0 ).
Corollary 5. If two adjacent slabs σi and σi+1 require resolve filters g and h
respectively, then under HPF g and h can be merged. Assign prio(g ∪ h) = prio(g),
if prio(g) < prio(h); otherwise assign prio(g ∪ h) = prio(h).
Proof. In both σi and σi+1 , g ∪ h is the highest priority resolve filter stabbed by
epi and epi+1 .
14.4
Setting up IP lookup with Slab-Detect
In the following we discuss IP lookup under MSTB where we utilise the structure
ordLen(T ) introduced in section 14.1. Alternatively, we can substitute ordLen(T )
with ordP rio(T ) introduced in section 14.3 to allow IP lookup under HPF.
Think of the x-axis as timeline. Note that the sets of line segments intersecting
contiguous slabs are similar. As the boundary from one slab to the next is crossed,
certain segments are deleted from the set and other segments are inserted. Over
the entire time range, there are 2n insertions and deletions, one insertion and one
deletion per segment.
Ordinary data structures are ephemeral in the sense that an update on the structure destroys the old version, leaving only the new version available for use. A data
structure is called partially persistent if all intermediate versions can be accessed,
but only the newest version can be modified, and fully persistent if every version
can be both accessed and modified. The obvious way to provide persistence is to
137
138
Chapter 14. Detecting and Resolving Conflicts
make a copy of the data structure each time it is changed. Refer to Driscoll et
al. [135] for a systematic study of persistence.
The idea is to maintain a data structure during Slab-Detect’s sweep that stores
for each slab the segments that cover the slab and also the resolve filter if the slab
was found to be non-conflict-free, cf. Figure 14.6.
Figure 14.6: Slab-Detect and the partially persistent version of ordLen(T ).
After Slab-Detect completes its full run, let T p refer to the partially persistent
version of ordLen(T ). Let R be a set of n one-dimensional range filters and p be
an incoming packet to be classified. Version x of T p consists of the intervals in
R that intersect the line x = epi . For a stabbing point p, we search the highest
version less than p. Therefore, we need an auxiliary data structure to store the
access pointers to the various versions. When the pointers are stored in a balanced
binary search tree, initiating access into any version takes O(log n) time. We know
that the intervals in each version of T p are ordered with respect to their ascending
lengths. Hence, the shortest (most specific) filter in each version can be reported
in O(1) time. IP lookup can thus be performed in O(log n) time.
Mohamed, Langner and Ottmann [136] propose path-merging as a refinement of
techniques used to make linked data structures partially persistent. Path-merging
supports bursts of operations between any two adjacent versions in contrast to only
one operation in the original variants. Utilizing the path-merging technique we are
14.5. Contributions and concluding remarks
able to solve the conflict detection problem in time of O(n log n) while building
the partially persistent structure, and then utilize it to answer lookup queries in
O(log n) time.
14.5
Contributions and concluding remarks
We presented an output-sensitive sweepline algorithm to make a given set of onedimensional arbitrary range filters conflict-free. The scheme achieves a worst case
time complexity of O(n log n) and uses O(n) space, where n is the number of
filters in the set. The number of reported resolve filters is not always minimal, yet
the measure of their union is minimal in order to make the set conflict-free. For
example, consider the filter set in Figure 14.7. Slab-Detect reports the two resolve
filters d and e to make the original set a, b, c conflict-free, even though one resolve
filter a ∩ c would be sufficient. Yet, the measure of the union of d and e is smaller
than a ∩ c.
Figure 14.7: Slab-Detect reports the two resolve filters d and e in order to make
the original set a, b, c conflict-free.
In summary, Slab-Detect:
• does not report duplicate resolve filters
• only reports essential resolve filters (at most O(n))
• with an augmentation, IP lookup is supported
• is adaptable for use under the highest-priority filter rule (HPF), with same
runtime performances and space complexities as under MSTB
This work is unique in a sense that there are no similar previous works to benchmark our results on. This is because all the literary evidence we came across
either deal exclusively with arbitrary range filters in the online variant, or, if the
authors deal with the offline variant, the data sets involved are prefix filters in
higher dimensions.
139
Part IV
Summary of Contributions
Summary of Contributions and
Future Directions
However impenetrable it seems, if you don’t try it, then you can never
do it.
Andrew Wiles
A dynamic routing protocol adjusts to changing network topologies, which are indicated in update messages that are exchanged between routers. If a link attached
to a router goes down or becomes congested, the routing protocol makes sure that
other routers know about the change. From these updates a router constructs a
forwarding table which contains a set of network addresses and a reference to the
interface that leads to that network. In a turbulent period, one or a few major
routing events cause several routes to simultaneously get updated.
Since the performance of the lookup device plays a crucial role in the overall performance of the Internet, it is important that lookup and update operations are
performed as fast as possible. In order to accelerate the operations, routing tables
must be implemented in a way that they can be queried and modified concurrently
by several processes. Relaxed balancing has become a commonly used concept in
the design of concurrent search tree algorithms. In relaxed balanced data structures, rebalancing is uncoupled from updates and may be arbitrarily delayed.
The first part of this dissertation proposed the relaxed balanced min-augmented
range tree and presented an experimental comparison with the strictly balanced
min-augmented range tree in a concurrent environment. The benchmark results
confirmed the hypothesis that the relaxed balanced min-augmented range tree is
better suited for the representation of dynamic IP router tables than the strictly
balanced version of the tree. The higher the update frequency, and the higher, up
to a certain amount, the number of processes, the clearer did the relaxed version
outperform the standard version. These results were presented at the 13th IEEE
Symposium on Computers and Communications (ISCC 2008) [137].
Further, we presented an interactive visualization of the relaxed balanced MART
which corroborated the correctness of the proposed locking schemes.
The continuous growth of network link rates pose a grand challenge on high speed
IP lookup engines. Given such high data rates, IP lookup must be implemented
in hardware. Duchene and Hanna [94] proposed a technique to generate flowpaths
144
directly from java byte codes representing multithreaded java programs. It would
be interesting to investigate RMART’s performance when implemented directly in
hardware.
Modern IP routers further provide policy-based routing (PBR) mechanisms. PBR
provides a technique for expressing routing criteria based on the policies defined
by the network administrators, which complements the existing destination-based
routing scheme. PBR requires network routers to examine multiple fields of a
packet header in order to categorize it into the appropriate ”flow”. Flow identification entails searching a table of predefined rules to identify the appropriate
flow based on criteria including IP address, port, and protocol type. Packet classification enables network routers to provide advanced network services including
network security, quality of service (QoS) routing, and monitoring.
The second part investigated if the popular R*-tree is suited for packet classification. To this end it was benchmarked with two representative classification
algorithms using the ClassBench tools suite. According to our benchmark results,
R*-trees can be considered an alternative solution for the static packet classification problem for Firewall (FW) filter sets. The contributions presented in the
second part of this thesis were presented at the Seventh IEEE International Symposium on Network Computing and Applications (NCA 2008) [138].
Scenarios where policies need to get updated include network state changes (e.g.,
link failures) along the policy-based routes or dynamic topologies (ad hoc networks). Most existing packet classification solutions do not support (fast) incremental updates. Another strength of the R*-tree is its support of dynamic incremental updates. Hence it would be interesting to investigate the performance of
R*-trees in a dynamic classification environment. To this end, the PALAC simulator could be employed. PALAC is a packet lookup and classification simulator that
was designed by Gupta and Balkman and is freely available for public use [139].
The simulator provides facilities for traffic generation as well as classifier updates
generation. Updates are interleaved with packet lookups/classifications during a
simulation. PALAC outputs a variety of statistics including algorithm storage,
worst case as well as average classification time, and the number of dropped packets. The simulator further provides a repository of algorithms which currently
contains Linear Search, Trie Search and Heap-on-Trie Search. Each algorithm is
subclassed from a generic class. If a user wants to evaluate a new algorithm, it
must be subclassed from the generic class and integrated into the PALAC architecture. The simulator is implemented in C++.
Further, additional research is required to evaluate R*-tree’s performance when
implemented directly in hardware.
The header of an arriving packet may match more than one filter, in which case the
filter with the highest priority among all the matching filters is commonly chosen
145
as the best matching filter. Applying the highest-priority tiebreaker resolves this
ambiguity in the classification process. Yet not any policy can be implemented
by prioritizing the filters. If the most specific tiebreaker, analogous to the most
specific tiebreaker (MSTB) in one-dimensional IP lookup, is to be deployed, it
must be ensured that for each packet p there is a well defined most specific filter
that applies to p. A seminal technique adds so-called resolve filters for each pair
of conflicting filters which guarantees that the most specific tiebreaker can be applied.
The third part of this dissertation presented a conflict detection and resolution
scheme for static one-dimensional range tables. The proposed algorithm achieves
a worst case time complexity of O(n log n) and reports only O(n) resolve filters
to make a given set of n one-dimensional arbitrary range filters in a router table
conflict-free, under both MSTB and HPF tiebreakers. Further, we have shown that
by making use of partial persistence, the structure also supports IP lookup. These
contributions were presented at the 26th Annual IEEE Conference on Computer
Communications (INFOCOM 2007) [129].
An overview of the contributions presented in part I and part III will be published
in [140].
Bibliography
[1] T. Sheldon, McGraw-Hill’s Encyclopedia of Networking and Telecommunications. McGraw-Hill Professional, 2001.
[2] (2008) The IPv6 Portal. [Online]. Available: http://www.ipv6tf.org/index.
php?page=meet/history
[3] (2008) IPv4 Address Report. [Online]. Available: http://www.potaroo.net/
tools/ipv4/
[4] (2008) Classless Inter-Domain Routing. [Online]. Available:
//en.wikipedia.org/wiki/Classless Inter-Domain Routing
http:
[5] (2008) Policy-based routing. [Online]. Available: http://www.cisco.com/
warp/public/732/Tech/plicy wp.htm
[6] S. Hanke, “The performance of concurrent red-black tree algorithms,” Lecture Notes in Computer Science, vol. 1668, pp. 286–300, 1999.
[7] A. Hari, S. Suri, and G. Parulkar, “Detecting and resolving packet filter
conflicts,” in INFOCOM 2000: Proceedings of the Nineteenth Annual Joint
Conference of the IEEE Computer and Communications Societies. IEEE
Press, 2000, pp. 1203–1212.
[8] (2008) Bgp update reports. [Online]. Available: http://bgp.potaroo.net/
index-upd.html
[9] A. Datta and T. Ottmann, “A note on the IP table lookup problem,” Dec.
2004, unpublished.
[10] L. Guibas and R. Sedgewick, “A dichromatic framework for balanced trees,”
in Proceedings of the 19th Annual Symposium on Foundations of Computer
Science, 1978, pp. 8–21.
[11] J. L. W. Kessels, “On-the-fly optimization of data structures,” Commun.
ACM, vol. 26, no. 11, pp. 895–901, 1983.
148
BIBLIOGRAPHY
[12] T. Ottmann and E. Soisalon-Soininen, “Relaxed balancing made simple,”
Institut für Informatik, Albert-Ludwigs-Universität Freiburg, Tech.
Rep. 71, Jan. 1995. [Online]. Available: ftp://ftp.informatik.uni-freiburg.
de/documents/reports/report71/
[13] Larsen and Fagerberg, “B-trees with relaxed balance,” in IPPS: 9th
International Parallel Processing Symposium. IEEE Computer Society
Press, 1995. [Online]. Available: citeseer.ist.psu.edu/larsen95btrees.html
[14] S. Hanke, T. Ottmann, and E. Soisalon-Soininen, “Relaxed balanced redblack trees,” in CIAC ’97: Proceedings of the Third Italian Conference on
Algorithms and Complexity. London, UK: Springer Verlag, 1997, pp. 193–
204.
[15] K. S. Larsen, T. Ottmann, and E. Soisalon-Soininen, “Relaxed balance for
search trees with local rebalancing,” Acta Informatica, vol. 37, no. 10, pp.
743–763, 2001. [Online]. Available: citeseer.ist.psu.edu/larsen97relaxed.html
[16] K. S. Larsen, “Relaxed red-black trees with group updates,” Acta Informatica, vol. 38, no. 8, pp. 565–586, 2002.
[17] L. Malmi and E. Soisalon-Soininen, “Group updates for relaxed heightbalanced trees,” in PODS ’99: Proceedings of the eighteenth ACM SIGMODSIGACT-SIGART symposium on Principles of database systems. New York,
NY, USA: ACM Press, 1999, pp. 358–367.
[18] K. S. Larsen, “Relaxed multi-way trees with group updates,” in PODS ’01:
Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium
on Principles of database systems. New York, NY, USA: ACM Press, 2001,
pp. 93–101.
[19] T. Seddig, “Balancierung von Datenstrukturen zur Lösung des
Paketklassifizierung-Problems,” Diplomarbeit, Institut für Informatik,
Albert-Ludwigs-Universität Freiburg, Apr. 2006.
[20] W. Wittmann, “Nebenläufige RBRSMAB Prozesse und ihre Visualisierung,” Bachelorarbeit, Institut für Informatik, Albert-LudwigsUniversität Freiburg, July 2008.
[21] H. Lu and S. Sahni, “O(log n) dynamic router-tables for prefixes and ranges,”
IEEE Transanctions on Computers, vol. 53, no. 10, pp. 1217–1230, 2004.
[22] E. M. McCreight, “Priority search trees.” SIAM J. Comput., vol. 14, no. 2,
pp. 257–276, 1985.
[23] D. E. Knuth, The Art of Computer Programming. Volume 3 Sorting and
Searching. Addison-Wesley, 1998.
BIBLIOGRAPHY
[24] M. Ruiz-Sanchez, E. Biersack, and W. Dabbous, “Survey and taxonomy of
IP address lookup algorithms,” Network, IEEE, vol. 15, no. 2, pp. 8–23,
2001.
[25] V. Srinivasan and G. Varghese, “Faster IP lookups using controlled prefix
expansion,” in SIGMETRICS ’98/PERFORMANCE ’98: Proceedings of the
1998 ACM SIGMETRICS Joint International Conference on Measurement
and Modeling of Computer Systems. New York, NY, USA: ACM Press,
1998, pp. 1–10.
[26] H. Song, J. Turner, and J. Lockwood, “Shape shifting tries for faster IP
route lookup,” in ICNP ’05: Proceedings of the 13th IEEE International
Conference on Network Protocols (ICNP’05). Washington, DC, USA: IEEE
Computer Society, 2005, pp. 358–367.
[27] W. Lu and S. Sahni, “Recursively partitioned static IP router-tables,” in 12th
IEEE Symposium on Computers and Communications, 2007, pp. 437–442.
[28] W. Lu and S. Sahni, “Succinct representation of static packet classifiers,”
in 12th IEEE Symposium on Computers and Communications, 2007, pp.
1119–1124.
[29] I. Lee, K. Park, Y. Choi, and S. K. Chung, “A simple and scalable algorithm
for the IP address lookup problem,” Fundamenta Informaticae, vol. 56, no.
1,2, pp. 181–190, 2003.
[30] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, Computational Geometry: Algorithms and Applications. Springer Verlag, 2000.
[31] H. Lu and S. Sahni, “Enhanced interval trees for dynamic IP router-tables,”
IEEE Transactions on Computers, vol. 53, no. 12, pp. 1615–1628, 2004.
[32] B. Lampson, V. Srinivasan, and G. Varghese, “IP lookups using multiway
and multicolumn search,” IEEE/ACM Trans. Netw., vol. 7, no. 3, pp. 324–
334, 1999.
[33] P. Warkhede, S. Suri, and G. Varghese, “Multiway range trees: scalable IP
lookup with fast updates,” Computer Networks, vol. 44, no. 3, pp. 289–303,
2004.
[34] S. Dharmapurikar, P. Krishnamurthy, and D. E. Taylor, “Longest prefix
matching using bloom filters,” in SIGCOMM ’03: Proceedings of the 2003
conference on Applications, technologies, architectures, and protocols for
computer communications. New York, NY, USA: ACM, 2003, pp. 201–
212.
149
150
BIBLIOGRAPHY
[35] A. J. McAuley and P. Francis, “Fast routing table lookup using
CAMs,” in INFOCOM (3), 1993, pp. 1382–1391. [Online]. Available:
citeseer.ist.psu.edu/mcauley93fast.html
[36] H. Song and J. S. Turner, “Fast filter updates for packet classification using
TCAM,” in GLOBECOM. IEEE, 2006.
[37] S. de Silva, “6500 FIB Forwarding Capacities,” Presentation, Cisco
Systems, Inc., 2007. [Online]. Available: http://www.nanog.org/mtg-0702/
presentations/fib-desilva.pdf
[38] D. E. Taylor, “Survey and taxonomy of packet classification techniques,”
ACM Comput. Surv., vol. 37, no. 3, pp. 238–275, 2005.
[39] “Efficient Scaling for Multiservice Networks,” White Paper, Juniper
Networks, July 2008. [Online]. Available:
http://www.juniper.net/
solutions/literature/white papers/200207.pdf
[40] F. Zane, G. Narlikar, and A. Basu, “CoolCAMs: Power-Efficient TCAMs
for Forwarding Engines,” in Proceeding of IEEE INFOCOM ’03, 2003.
[41] E. Spitznagel, D. Taylor, and J. Turner, “Packet classification using Extended TCAMs,” in Proceedings of IEEE International Conference on Network Protocols (ICNP), 2003.
[42] W. Eatherton, G. Varghese, and Z. Dittia, “Tree bitmap: hardware/software
IP lookups with incremental updates,” SIGCOMM Comput. Commun. Rev.,
vol. 34, no. 2, pp. 97–122, 2004.
[43] R. Zemach, “CRS-1 overview,” Presentation, Cisco Systems, Inc. [Online].
Available: www.cs.ucsd.edu/∼varghese/crs1.ppt
[44] K. S. Kim and S. Sahni, “Efficient construction of pipelined multibit-trie
router-tables,” IEEE Transactions on Computers, vol. 56, no. 1, pp. 32–43,
2007.
[45] W. Jiang, Q. Wang, and V. Prasanna, “Beyond TCAMs: An SRAM-Based
Parallel Multi-Pipeline Architecture for Terabit IP Lookup,” INFOCOM
2008. The 27th Conference on Computer Communications. IEEE, pp. 1786–
1794, April 2008.
[46] T. Lauer, T. Ottmann, and A. Datta, “Update-efficient data structures for
dynamic IP router tables,” International Journal of Foundations of Computer Science, vol. 18, no. 1, pp. 139–161, 2007.
BIBLIOGRAPHY
[47] T. Lauer, “Potentials and limitations of visual methods for the exploration
of complex data structures,” Ph.D. dissertation, Albert-Ludwigs-Universität
Freiburg, 2007.
[48] R. Hinze, “A simple implementation technique for priority search queues,”
in International Conference on Functional Programming, 2001, pp. 110–121.
[Online]. Available: citeseer.ist.psu.edu/hinze01simple.html
[49] N. Sarnak and R. E. Tarjan, “Planar point location using persistent search
trees,” Commun. ACM, vol. 29, no. 7, pp. 669–679, 1986.
[50] J. Boyar and K. S. Larsen, “Efficient rebalancing of chromatic search trees,”
in Proceedings of the 30th IEEE symposium on Foundations of computer
science. Orlando, FL, USA: Academic Press, Inc., 1994, pp. 667–682.
[51] G. R. Andrews and F. B. Schneider, “Concepts and notations for concurrent
programming,” ACM Comput. Surv., vol. 15, no. 1, pp. 3–43, 1983.
[52] (2008) The Java Tutorials. Lesson: Concurrency. [Online]. Available:
http://java.sun.com/docs/books/tutorial/essential/concurrency/index.html
[53] C. S. Ellis, “Concurrent search and insertion in AVL trees.” IEEE Trans.
Computers, vol. 29, no. 9, pp. 811–817, 1980.
[54] O. Nurmi and E. Soisalon-Soininen, “Chromatic binary search trees. a structure for concurrent rebalancing.” Acta Inf., vol. 33, no. 6, pp. 547–557, 1996.
[55] P. Brinch-Hansen, Operating system principles. Prentice-Hall, Inc., 1973.
[56] C. A. R. Hoare, “Monitors: an operating system structuring concept,” Commun. ACM, vol. 17, no. 10, pp. 549–557, 1974.
[57] E. G. Coffman, M. Elphick, and A. Shoshani, “System deadlocks,” ACM
Comput. Surv., vol. 3, no. 2, pp. 67–78, 1971.
[58] R. C. Holt, “Some deadlock properties of computer systems,” ACM Comput.
Surv., vol. 4, no. 3, pp. 179–196, 1972.
[59] B. Price, I. Small, and R. Baecker, “A taxonomy of software visualization,”
Proceedings of the Twenty-Fifth Hawaii International Conference on System
Sciences, vol. 2, pp. 597–606, Jan 1992.
[60] (editors) J. T. Stasko, J. B. Domingue, M. H. Brown, and B. A. Price,
Software Visualization: Programming as a Multimedia Experience. The
MIT Press, 1998.
151
152
BIBLIOGRAPHY
[61] H. H. Goldstein and J. von Neumann, “Planning and coding problems for
an electronic computing instrument,” von Neumann Collected Works, vol. 5,
pp. 80–151, 1947.
[62] R. Fleischer and L. Kucera, “Algorithm animation for teaching,” in Revised Lectures on Software Visualization, International Seminar, ser. Lecture
Notes in Computer Science, S. Diehl, Ed., vol. 2269. Springer-Verlag, 2002,
pp. 113–128.
[63] K. Knowlton, “Bell telephone laboratories low-level linked list language,”
1966, 16-minute black and white film.
[64] (2008, July) Red Black Tree Simulation. [Online]. Available:
//reptar.uta.edu/NOTES5311/REDBLACK/RedBlack.html
http:
[65] (2008, July) Red-Black Tree Demonstration. [Online]. Available: http:
//www.ece.uc.edu/∼franco/C321/html/RedBlack/rb.orig.html
[66] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software. AddisonWesley, 1994.
[67] “Java PathFinder,” 2009, http://javapathfinder.sourceforge.net/.
[68] M. Ichiriu, “High Performance Layer 3 Forwarding. The Need for Dedicated
Hardware Solutions,” White Paper, NetLogic Microsystems, 2000. [Online].
Available: http://www.netlogicmicro.com/pdf/cidr white paper.pdf
[69] BGP Update Reports. [Online]. Available:
index-upd.html
http://bgp.potaroo.net/
[70] C. Villamizar, R. Chandra, and R. Govindan. (1998, Nov.) BGP
Route Flap Damping. Request for Comments: 2439. [Online]. Available:
http://www.ietf.org/rfc/rfc2439.txt
[71] G. Huston. (2006, June) The BGP Report for 2005. The ISP Column.
[Online]. Available: http://ispcolumn.isoc.org/2006-06/bgpupds.html
[72] (2008) University of Oregon Route Views Project. [Online]. Available:
http://archive.routeviews.org/bgpdata/
[73] L. Blunk, M. Karir, and C. Labovitz. MRT Format. [Online]. Available:
http://tools.ietf.org/html/draft-ietf-grow-mrt-00
[74] Routing information service (RIS). [Online]. Available: http://www.ripe.
net/ris/
153
BIBLIOGRAPHY
[75] R. Jain and S. Routhier, “Packet trains–Measurements and a new model for
computer network traffic,” IEEE Selected Areas in Communications, vol. 4,
no. 6, pp. 986 – 995, 1986.
[76] K. C. Claffy, “Internet traffic characterization,” Ph.D. dissertation, University of California, San Diego, 1994.
[77] M. H. MacGregor and I. L. Chvets, “Locality in internetwork traffic,” University of Alberta, Tech. Rep., Mar. 2002.
[78] D. J. Lee and N. Brownlee, “Passive measurement of one-way and two-way
flow lifetimes,” ACM SIGCOMM, vol. 37, no. 3, pp. 19–27, 2007.
[79] Y. Chabchoub, C. Fricker, F. Guillemin, and P. Robert, “A study of flow
statistics of IP traffic with application to sampling,” July 2007, unpublished.
[Online]. Available: http://www-rocq.inria.fr/Philippe.Robert/src/papers/
2007-4.pdf
[80] (2008) Pareto distribution. [Online]. Available: http://en.wikipedia.org/
wiki/Pareto distribution
[81] “SUN FIRE T1000 and T2000 SERVER ARCHITECTURE,” White
Paper, Sun microsystems, Dec. 2005. [Online]. Available:
http:
//www.sun.com/servers/coolthreads/coolthreads architecture wp.pdf
[82] J. M. O’Connor and M. Tremblay, “picoJava-I: The Java Virtual Machine
in Hardware,” IEEE Micro, vol. 17, no. 2, pp. 45–53, 1997.
[83] (2008) Execution of synchronized Java methods in Java computing
environments. United States Patent 6918109. [Online]. Available: http:
//www.patentstorm.us/patents/6918109/fulltext.html
[84] (2008) Fast synchronization for programs written in the JAVA programming
language. United States Patent 6349322. [Online]. Available: http:
//www.freepatentsonline.com/6349322.html
[85] “SUN FIRE X4600 M2 SERVER ARCHITECTURE,” White Paper,
Sun microsystems, June 2008. [Online]. Available: http://www.sun.com/
servers/x64/x4600/arch-wp.pdf
[86] J. Nehmer and P. Sturm, Systemsoftware. Grundlagen moderner Betriebssysteme. dpunkt, 1998.
[87] K. Beuth and W. Schmusch, Grundschaltungen. Vogel, 1992.
[88] (2008) Field-programmable gate array. [Online]. Available:
wikipedia.org/wiki/Field-programmable gate array
http://en.
154
BIBLIOGRAPHY
[89] (2008) FPGA Basics. [Online]. Available:
whatisan.htm
http://www.andraka.com/
[90] D. Hanna, A. Spagnuolo, and M. DuChene, “Speedup using flowpaths for a
finite difference solution of a 3D parabolic PDE,” Parallel and Distributed
Processing Symposium, 2007. IPDPS 2007. IEEE International, pp. 1–6,
March 2007.
[91] ajile. (2008) Embedded Low-Power Direct Execution Java Processors.
[Online]. Available: http://www.ajile.com/
[92] D. Hanna, M. DuChene, G. Tewolde, and J. Sattler, “Java flowpaths: Efficiently generating circuits for embedded systems from Java,” in International
Conference on Embedded Systems and Applications, Nov. 2006, pp. 23–30.
[93] D. M. Hanna and R. E. Haskell, “Implementing Software Programs in FPGAs using Flowpaths,” in International Conference on Embedded Systems
and Applications, 2004, pp. 76–82.
[94] M. Duchene and D. Hanna, “Implementing parallel algorithms on an FPGA
directly from multithreaded Java using flowpaths,” Circuits and Systems,
2005. 48th Midwest Symposium on, pp. 980–983 Vol. 2, Aug. 2005.
[95] C. A. Shue and M. Gupta, “Projecting IPv6 Forwarding Characteristics under Internet-wide Deployment,” in ACM SIGCOMM 2007 IPv6 Workshop,
Aug. 2007.
[96] P. Owezarski, “Does IPv6 Improve the Scalability of the Internet?” in
IDMS/PROMS 2002: Proceedings of the Joint International Workshops on
Interactive Distributed Multimedia Systems and Protocols for Multimedia
Systems. London, UK: Springer-Verlag, 2002, pp. 130–140.
[97] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in
SIGMOD Conference, B. Yormark, Ed. ACM Press, 1984, pp. 47–57.
[98] (2008) Classbench: A packet classification benchmark. [Online]. Available:
http://www.arl.wustl.edu/∼det3/ClassBench/index.htm
[99] P. Gupta and N. McKeown, “Packet classification on multiple fields,” in
SIGCOMM ’99: Proceedings of the conference on Applications, technologies,
architectures, and protocols for computer communication. ACM, 1999, pp.
147–160.
[100] P. Gupta and N. McKeown, “Algorithms for packet classification,” IEEE
Network, vol. 15, no. 2, pp. 24–32, 2001.
BIBLIOGRAPHY
[101] P. F. Tsuchiya, “A search algorithm for table entries with non-contiguous
wildcarding,” 1991, unpublished. [Online]. Available: citeseer.ist.psu.edu/
tsuchiya91search.html
[102] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, “Fast and scalable
layer four switching,” in Proceedings of SIGCOMM ’98, 1998, pp. 191–202.
[Online]. Available: citeseer.ist.psu.edu/article/srinivasan98fast.html
[103] F. Baboescu, S. Singh, and G. Varghese, “Packet classification for core
routers: is there an alternative to CAMs?” in INFOCOM 2003. TwentySecond Annual Joint Conference of the IEEE Computer and Communications Societies. IEEE, March-April 2003, pp. 53–63.
[104] Gupta and McKeown, “Packet classification using hierarchical intelligent
cuttings,” in Proceedings of Hot Interconnects VII, 1999.
[105] S. Singh, F. Baboescu, G. Varghese, and J. Wang, “Packet classification
using multidimensional cutting,” in SIGCOMM ’03: Proceedings of the
2003 conference on Applications, technologies, architectures, and protocols
for computer communications. ACM, 2003, pp. 213–224.
[106] Y. Qi and J. Li, “Towards effective packet classification,” in IASTED Communication, Network, and Information Security, 2006.
[107] F. Geraci, M. Pellegrini, P. Pisati, and L. Rizzo, “Packet classification via
improved space decomposition techniques.” in INFOCOM. IEEE, 2005, pp.
304–312.
[108] M. M. Buddhikot, S. Suri, and M. Waldvogel, “Space decomposition techniques for fast Layer-4 switching,” in Protocols for High Speed Networks IV
(Proceedings of PfHSN ’99), J. D. Touch and J. P. G. Sterbenz, Eds. Salem,
MA, USA: Kluwer Academic Publishers, Aug. 1999, pp. 25–41.
[109] H. Lim, M. Y. Kang, and C. Yim, “Two-dimensional packet classification
algorithm using a quad-tree,” Comput. Commun., vol. 30, no. 6, pp. 1396–
1405, 2007.
[110] T. Y. C. Woo, “A modular approach to packet classification: Algorithms
and results,” in INFOCOM (3), 2000, pp. 1213–1222.
[111] “Understanding ACL on Catalyst 6500 Series Switches,” White Paper, Cisco
Systems, Inc. [Online]. Available: http://www.cisco.com/en/US/products/
hw/switches/ps708/products white paper09186a00800c9470.shtml
[112] C. Solder, “Understanding Quality of Service on the Catalyst 6500 and
Cisco 7600 Router,” White Paper, Cisco Systems, Inc., June 2006. [Online].
155
156
BIBLIOGRAPHY
Available: http://www.cisco.com/application/pdf/en/us/guest/products/
ps708/c1225/ccmigration 09186a00806eca1e.pdf
[113] “Cisco Catalyst 6500 and 6500-E Series Switch Data Sheet,”
White
Paper,
Cisco
Systems,
Inc.
[Online].
Available:
http://www.cisco.com/en/US/prod/collateral/modules/ps2797/ps5138/
product data sheet09186a00800ff916 ps708 Products Data Sheet.html
[114] R. Bayer and E. M. McCreight, “Organization and Maintenance of Large
Ordered Indices,” Acta Inf., vol. 1, pp. 173–189, 1972.
[115] Y. Manolopoulos, A. Nanopoulos, A. N. Papadopoulos, and Y. Theodoridis,
R-trees: Theory and Applications. Springer Verlag, 2005.
[116] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: An
efficient and robust access method for points and rectangles,” in SIGMOD
Conference. ACM Press, 1990, pp. 322–331.
[117] V. Gaede and O. Günther, “Multidimensional access methods,” ACM Comput. Surv., vol. 30, no. 2, pp. 170–231, 1998.
[118] T. K. Sellis, N. Roussopoulos, and C. Faloutsos, “The R-tree: A dynamic
index for multi-dimensional objects,” in The VLDB Journal, 1987, pp.
507–518. [Online]. Available: citeseer.ist.psu.edu/sellis87rtree.html
[119] P. W. Huang, P. L. Lin, and H. Y. Lin, “Optimizing storage utilization in Rtree dynamic index structure for spatial databases,” J. Syst. Softw., vol. 55,
no. 3, pp. 291–299, 2001.
[120] S. Brakatsoulas, D. Pfoser, and Y. Theodoridis, “Revisiting R-tree construction principles,” in ADBIS ’02: Proceedings of the 6th East European Conference on Advances in Databases and Information Systems. London, UK:
Springer-Verlag, 2002, pp. 149–162.
[121] N. Roussopoulos and D. Leifker, “Direct spatial search on pictorial databases
using packed R-trees,” SIGMOD Rec., vol. 14, no. 4, pp. 17–31, 1985.
[122] L. Arge, M. de Berg, H. J. Haverkort, and K. Yi, “The Priority R-Tree: A
practically efficient and worst-case optimal R-tree,” in SIGMOD Conference,
G. Weikum, A. C. König, and S. Deßloch, Eds. ACM, 2004, pp. 347–358.
[123] D. E. Taylor and J. S. Turner, “Classbench: A Packet Classification Benchmark,” in INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies, 2005, pp. 2068–2079.
[124] D. Taylor and J. Turner, “Classbench: A Packet Classification Benchmark,”
Washington University in Saint Louis, Tech. Rep., May 2004.
BIBLIOGRAPHY
[125] H. Song. (2008) Packet classificaton evaluation. [Online]. Available:
http://www.arl.wustl.edu/∼hs1/PClassEval.html
[126] (2008) The R-tree Portal. C++ and Java implementations. [Online]. Available: http://www.rtreeportal.org/index.php?option=com content&task=
view&id=17&Itemid=32
[127] D. E. Taylor, “Models, algorithms, and architectures for scalable packet
classification,” Ph.D. dissertation, Washington University, 2004.
[128] C. Kupich and K. A. Mohamed, “Conflict Detection in Internet Router Tables,” Institut für Informatik, Albert-Ludwigs-Universität Freiburg, Tech.
Rep., Aug. 2006.
[129] C. Maindorfer, K. A. Mohamed, T. Ottmann, and A. Datta, “A new outputsensitive algorithm to detect and resolve conflicts in Internet router tables,”
in INFOCOM 2007. 26th IEEE Conference on Computer Communications,
May 2007, pp. 2431–2435.
[130] E. Al-Shaer and H. Hamed, “Management and translation of filtering security policies,” ICC ’03: IEEE International Conference on Communications,
vol. 1, pp. 256–260, May 2003.
[131] H. Lu and S. Sahni, “Conflict detection and resolution in two-dimensional
prefix router tables,” IEEE/ACM Transactions on Networking, vol. 13, no. 6,
pp. 1353–1363, 2005.
[132] D. Eppstein and S. Muthukrishnan, “Internet packet filter management and
rectangle geometry,” in SODA 01: Proceedings of the Twelfth Annual ACMSIAM Symposium on Discrete Algorithms. Philadelphia, PA, USA: Society
for Industrial and Applied Mathematics, 2001, pp. 827–835.
[133] M. H. Overmars and C.-K. Yap, “New upper bounds in Klee’s measure
problem,” SIAM J. Comput., vol. 20, no. 6, pp. 1034–1045, 1991.
[134] K. A. Mohamed and C. Maindorfer, “An O(n log n) Output-Sensitive Algorithm to Detect and Resolve Conflicts for 1D Range Filters in Router
Tables,” Institut für Informatik, Albert-Ludwigs-Universität Freiburg, Tech.
Rep., Oct. 2006.
[135] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan, “Making data
structures persistent,” in STOC 86: Proceedings of the eighteenth annual
ACM symposium on Theory of computing. New York, NY, USA: ACM
Press, 1986, pp. 109–121.
157
158
BIBLIOGRAPHY
[136] K. A. Mohamed, T. Langner, and T. Ottmann, “Versioning tree structures by
path-merging,” in Frontiers in Algorithmics, ser. Lecture Notes in Computer
Science, F. P. Preparata, X. Wu, and J. Yin, Eds., vol. 5059. Springer, 2008,
pp. 101–112.
[137] C. Maindorfer, B. Bär, and T. Ottmann, “Relaxed min-augmented range
trees for the representation of dynamic IP router tables,” in 13th IEEE
Symposium on Computers and Communications, July 2008, pp. 920–927.
[138] C. Maindorfer and T. Ottmann, “Is the Popular R*-tree Suited for Packet
Classification?” in Seventh IEEE International Symposium on Network
Computing and Applications, July 2008, pp. 168–176.
[139] P. Gupta and J. Balkman. (2008) Packet Lookup and Classification
Simulator (PALAC). [Online]. Available: http://klamath.stanford.edu/
tools/PALAC/SRC/
[140] C. Maindorfer, T. Lauer, and T. Ottmann, “New data structures for IP
lookup and conflict detection,” in Algorithmics of Large and Complex Networks, ser. Lecture Notes in Computer Science. Springer, 2009, to appear.
Download