QuantifyingDDoS - FreeBSD.org

advertisement
Quantifying Design Aspects of a Platform for
Distributed Denial-of-Service Attack Mitigation
Bosko Milekic <bmilekic@FreeBSD.org>
Bram Edward Sugarman <bsugar@po-box.mcgill.ca>
Supervised by Anton L. Vinokurov <avinok@tsp.ece.mcgill.ca>
April, 2005. McGill University, Montreal, Canada.
Abstract
Distributed Denial-of-Service (DDoS) attacks are a more and more common problem
facing Internet-exposed organizations today. One of the greatest challenges in the design
and quantitative measurement of a platform for mitigating DDoS attacks is the ability to
generate the required traffic patterns and loads required in a real network setup so as to
notice performance impacts of modifications made. This paper presents the methodology
used to test the network packet processing performance of the beginnings of a solution to
the DDoS problem. The design overview of a custom packet generating solution is
presented. The packet processing performance effects of various custom kernel options
and a couple of modifications to the packet processing path in a popular Open-Source
Operating System are considered.
Introduction to DDoS: Definition & Relevance
Distributed Denial-of-Service (DDoS) attacks have become more and more prevalent
throughout the past five years as high-speed Internet connectivity has become
widespread. A DDoS attack is defined as follows:
Multiple attack sites overwhelming a victim site with a large quantity of traffic.
Victim’s resources are exhausted such that legitimate clients are refused service.
[Mirkovic 2003]
It is important to note from the above definition that the victim’s resources include not
only available network bandwidth toward the victim’s site but also the memory resources
allocated to service said malicious traffic along the path to the attack target host (typically
an end-host) and the processing delays caused throughout.
1
Figure 1 below shows a DDoS attack directed at a target victim located on the
public Internet within the local victim site topology shown.
Figure 1: Example of a DDoS Attack at a Multi-Homed Victim End-Host
It should be observed that the victim site is multi-homed by two connectivity providers:
Connection Provider A and Connection Provider B. Equally noteworthy is that the
depicted DDoS attack is coordinated by a single master host and 5 attack drone hosts, but
that actual DDoS attacks can and often do involve thousands if not hundreds of thousands
of attack hosts.
The attack depicted in Figure 1 has as ultimate objective the prevention of
legitimate traffic flow to and from the Victim End-Host. Such a situation is achieved
either by overwhelming or exploiting resources at the End-Host itself or by exhausting all
available bandwidth in both Customer Link A and Customer Link B. Although the fact
that the victim’s site is multi-homed may act in the victim’s favor, sufficiently strong
attacks will lead to at least one of the two available links becoming exhausted, at which
point traffic shall overflow to the remaining link thereby exhausting it as well.
Descriptions and taxonomy of popular DDoS attacks and some mitigation
techniques can be found in [Mirkovic 2003] and [UCLATr020018].
The seriousness of DDoS attacks should not be underestimated. The attacks are
virtually unstoppable and are therefore currently the single largest threat to Internet2
exposed organizations, often capable of bringing them offline for extended periods of
time.
This is evidenced by the following snippet from a front-page article in The
Financial Times:
“More than a dozen offshore gambling sites serving the US market were hit by the
so-called Distributed Denial of Service (DDoS) attacks and extortion demands in
September. Sites have been asked to pay up to $50,000 to ensure they are free
from attacks for a year. Police are urging victims not to give in to blackmail and
to report the crime.” (source: [FTIMES])
Requirements for Effectively Mitigating DDoS
Currently a variety of techniques are being employed in hopes of mitigating DDoS
attacks. There are two distinct fronts on which these approaches attempt to combat the
attack: the first is the source-end, and the second is at the victim (customer) site.
One common, source-end solution is the IP source-routing tactic being deployed
by Internet Service Providers (ISPs) and traffic carriers. Since carriers typically have a
large routing setup, they are able to easily examine packets before routing them to the
public Internet. This is precisely the idea behind source routing: each packet attempting
to leave the ISP’s network has its source IP address examined. If the address is deemed
to have originated from one of the carrier’s own networks it is forwarded normally;
however, if it is determined that the source IP address is not from within any of the ISP’s
networks the packet is dropped and will never reach its target. In the case of ordinary
Denial of Service attacks the attacker will often make use of IP spoofing: the falsification
of an IP packet’s source address.
In such instances Source Routing is very effective.
However, in dealing with Distributed DoS attacks this tactic is essentially useless as IP
spoofing is rarely necessary (the attack traffic is originating from otherwise legitimate
hosts, often unbeknownst to their owners). When the overwhelming amount of attack
traffic reaches the routers it will be recognized as legitimate traffic and promptly
forwarded on its way toward the victim site. For this reason the detection of DDoS
attacks is very difficult at the source. The difficulty of proper source-end detection is
evidenced by much active research in this area (see for example [Mirkovic 2003]).
3
It is understandable that historically most solutions are placed at the point where
the attack has its most profound impact: the victim site.
Solutions that take the
approaches of attempting to mitigate DDoS attacks at the victim site often involve
employing sophisticated firewalls and packet filtering techniques. Such techniques are in
theory extremely effective; however, in practice other bottlenecks, such as processing
overhead, cause the victim site to drop legitimate traffic. Furthermore this type of
approach mostly transfers the point of failure from the victim site itself to the point of the
attempted solution. Figure 2 below illustrates the typical topology of these types of
solutions. Through careful analysis it is realized that this approach is adheres to this
transfer of failure point and is in fact inadequate.
Figure 2: Typical Customer-End Firewall Topology
The first downfall of the situation depicted in Figure 2 begins at the firewall itself.
As the DDoS attack rapidly propagates its massive amount of traffic, the firewall
becomes overwhelmed due to the packet load. In this case the firewall itself becomes the
bottleneck leading to failure: here the firewall is unable to examine the large number of
arriving packets fast enough, causing an unmanageable build-up of traffic. As the buildup becomes unbearable to the victim site, its resources are rapidly consumed. The
incredible amount of illegitimate traffic surged upon the victim has rendered the firewall
essentially useless, successfully monopolizing resources.
4
The second breakdown of the type of solution modeled by Figure 2 arises along
the pipe connecting the firewall to the public Internet. In the figure a DS3 line is depicted
representing this potential tunnel of disaster. The impending failure that arises in these
pipes during DDoS attacks occurs when they quickly fill due to aggregate traffic. In this
case, the build-up propagates along this pipe all the way to routers injecting the traffic
into the pipe. The queues of these routers rapidly fill due to their inability to send traffic
along the filled pipes and as a consequence they begin to drop packets. Once again the
DDoS attack has succeeded, and legitimate traffic destined for the victim site is refused
service.
While it would seem that the world is doomed to live in fear of DDoS attacks
forever, there is actually some active research in the field of DDoS mitigation that is
displaying some impressive potential. The solutions proposed in these projects focus on
filtering in firewalls. Here remarkable distributed and learning detection algorithms are
created; these algorithms aim to discern legitimate from illegitimate traffic in real-time.
Research of this variety is not only a step in the right direction, it is very important to the
ultimate creation of a DDoS mitigation solution. One of the most important attributes of
such a solution will undoubtedly be its ability to differentiate legitimate traffic from
illegitimate malicious traffic. It is also important to remember that there are many
different types of DDoS attacks, of which very few can be handled solely by the victim.
This fact exerts the need for a distributed and possibly coordinated solution (such as
[DWARD]).
Furthermore, research of this type is especially interesting from an
academic standpoint. This mainly attributed to the esteem of the fancy algorithms and
fancy database structures produced.
One such detection and filtering solution is proposed by the D-WARD project:
D-WARD (DDoS Network Attack Recognition and Defense) is a DDoS defense system
deployed at a source-end network. Its Goal is twofold:
1. Detect outgoing DDoS attacks and stop them by controlling outgoing traffic to the
victim.
2. Provide good service to legitimate transactions between the deploying network
and the victim while the attack is on-going.
([DWARD])
5
The D-WARD approach is much like most other source-end solutions in that it is
installed at source routers, in between the attack deploying network and the rest of the
Internet. In fact the ultimate goal of D-WARD is to have its firewalls installed at a large
number of such gateways. Once this is instituted, distributed algorithms involving many
D-WARD firewalls are able to collaborate on filter rule-sets. The theoretical aspects of
the D-WARD solution are seemingly invaluable; however, from a real world perspective
source-end solutions are difficult to sell. Sadly, there is little incentive for carriers to
widely deploy source-end solutions that have little immediate benefit for their own
customers (see [DDoSChain] for a proposed economic solution to this incentive chain
problem).
Although D-WARD and other areas of academic research have shown
encouraging results, they all fail to address an important aspect of an effective DDoS
mitigation solution: packet processing. The fact is that if packet processing is slow, then
firewall and filtering mitigation methods are useless. This is mainly attributed to the fact
that if the packets are being passed to the firewall at a rate higher than the firewall is able
to process them, the packet processing becomes the bottleneck; in this case all efforts
towards improving the firewall’s filtering and inspection techniques are rendered
practically useless.
It is evident that for a realistic DDoS-mitigating solution two distinct high-level
requirements are needed. The first is the ability to detect and discern illegitimate traffic
and subsequently install firewall rules/filters to block it. This is the type of work being
actively researched in solutions such as D-WARD. The second aspect is the ability to
process packets quickly and efficiently match them against large number of rules. The
research presented in the remainder of this paper focuses directly on this second
challenge.
Challenges in Measuring Network Packet Processing Performance
It is often thought that hardware offloading should solve the packet processing
bottleneck; however, in the case of DDoS attacks, hardware offloading is not as useful as
in High-End IP Router design (see [Juniper] high-end IP router architecture). In general
ASICs and hardware offloading do not work well for firewalls, particularly those that aim
6
to protect against DDoS attacks. DDoS attacks can be composed of many different types
of traffic, some of which require deep packet inspection to filter, and others which can be
filtered merely with shallow network-layer header inspection. This conflicting pair of
requirements is why it is difficult in anti-DDoS firewall design to find a common case to
offload to hardware. Essentially one would need to build the features and flexibility of
the operating system and software-based firewall into the hardware to appropriately deal
with varying types of packets. This need in turn causes the hardware requirements for the
offloading unit to increase thereby invalidating the hardware-offloading design pattern.
A firewall that could successfully offload network-layer header inspection to hardware
could therefore be rendered useless by crafting an attack that cannot be filtered with
merely network-layer header inspection.
With this in mind, it is more natural to consider an off-the-shelf server-grade
shared-memory multi-processor (SMP) system as a platform for a DDoS-mitigating
firewall, rather than a network device with custom circuitry. For this purpose, the opensource FreeBSD operating system which has ongoing and active development for SMP
hardware was an obvious choice (refer to [FreeBSD] and [Lehey] for more information).
The remainder of this paper considers two fundamental challenges in packet
processing performance improvement for the beginnings of a DDoS-mitigating solution:
(1) Determining the status-quo packet processing performance of the FreeBSD
operating system on SMP hardware;
(2) Considering an experimental modification to the FreeBSD developmentbranch kernel (operating system core software) and gauging its impact on
packet processing.
In order to accurately accomplish the primary goal of measuring the FreeBSD
system’s current packet processing abilities, the following was required:
(1) Set up or make use of a test network environment in which sufficient traffic
load could be generated so as to stress a target FreeBSD SMP system’s packet
processing code paths;
(2) Ensure that the target FreeBSD SMP system’s hardware is not the limiting
factor in testing;
7
(3) Ensure that the packet generation infrastructure (i.e., packet source hosts)
within the test network are capable of generating sufficiently high packet rates
without bottlenecking on the wire-speed limits of the test network and that
flexible traffic patterns can be easily generated.
Requirements (1) and (2) are discussed in this section while requirement (3) is addressed
in the following section entitled Design & Implementation of a Packet Factory &
Generator Software Solution.
The test cluster used for performing packet processing testing and measurements
for the work presented herein belongs to the FreeBSD Project and is commonly referred
to as the Zoo test bed or cluster (see [ZooCluster] for more information). The current
topology of the Zoo test cluster is depicted in Figure 3.
Figure 3: Current Zoo Test Cluster Network Topology
While various different machines are pictured in Figure 3, only the 3 identical tiger
machines were used in packet processing performance data gathering.
The tiger
machines consist of three hardware-identical SMP servers with four gigabit Ethernet
interfaces each, such that one gigabit Ethernet interface is connected to a management
switch (“MGMT switch” in Figure 3) thereby contributing to the internal management
and configuration network for all Zoo cluster machines. Of the remaining three network
interfaces on each tiger, two are used to directly connect to each of the remaining two
8
tiger hosts. Therefore, all three tiger machines are completely directly interconnected
thereby forming a gigabit Ethernet loop, ensuring that no switching elements exist
between any hosts and eliminating the possibility that a switching element becomes the
bottleneck.
As mentioned, the three tiger hosts used for testing are all SMP-capable servers
and are equipped with identical server-grade hardware. Each tiger machine features dual
Intel Xeon 3.06GHz Hyperthreading (HTT)-capable CPUs on an Intel Server Board
SE7501wv2 which includes three PCI buses and a 533MHz front-side-bus (see [Intel] for
more information). Each machine has 4GB of RAM. It is important to note that the Intel
SE7501wv2 board has a built-in Intel 82546EB dual-port gigabit Ethernet controller and
that the machines were also equipped with a second dual-port 82546EB in the form of a
Low-Profile 64-bit PCI-X card. The latter fact is important to remember because it
ensures that each 82546EB sits on a separate uncontested 64-bit PCI bus, thereby
eliminating the possibility of bottlenecking in hardware during high-frequency 82546EBto-CPU or RAM transfers which occur as a result of high packet rates to which the tiger
machines will be subjected to.
Figure 4: Intel Server Board SE7501wv2 Block Diagram (Source: [IntelTPS])
9
The bottom of Figure 4 depicts the P64H2 on the Intel SE7501wv2 board being
connected via PCI-X to the dual-port on-board 82546EB (on the left) and the Full-Profile
PCI-X slot. It also shows that the P64H2 is connected (on the right) to the Low-Profile
PCI-X slot and the SCSI Ultra320 controller found on some versions of the board. The
versions of the Intel SE7501wv2 found in the tiger machines do not include the on-board
SCSI Ultra320 controller and so, in fact, the right-side connection to the P64H2 shown in
Figure 4 only connects the Low-Profile PCI-X slot. In the tiger machines, the FullProfile PCI-X slot is left empty while the Low-Profile slot sports the PCI-X card version
of the dual-port 82546EB. Supposing that the PCI-X buses to the P64H2 are clocked at
100MHz, and for the moment neglecting PCI-X bus acquisition overhead, the amount of
data transferable via the 64-bit wide bus per second is:
100000000
6400000000bits
 64bits 
s
s
Since the 82546EB controllers are dual-port, they require at worse 2Gbits bus bandwidth
(neglecting required PCI-X bus acquisition overhead). The bus itself, since uncontested,
can provide approximately 6.4Gbps. The difference, 4.4Gbps, is more than sufficient to
handle bus acquisition overhead when the bus is uncontested, even in the most
conservative of guesses (see [PCIX] for justification).
The fact that the 82546EB
controllers perform significant traffic coalescing (see [82546EB]) further ensures that not
every packet results in a bus transaction, particularly at high packet rates, thereby
lowering the associated and neglected overhead.
Design & Implementation of a Packet Factory & Generator Software Solution
The two fundamental high-level requirements for a custom packet generating solution
were first that it allowed for the creation of arbitrary packets and sequences of said
packets and second that it provided a means to send constructed sequences to a specified
destination at very high rates, such that the aggregate packet rate produced by only two
sources approached wire-line limits of gigabit Ethernet even for small packet sizes.
Although efforts such as [Netperf], [NetPIPE], and [ttcp] offered adequate means
for generating test UDP and TCP traffic, they were not able to produce the necessary
packet load from merely two source hosts. Since the Zoo test cluster (see [ZooCluster])
10
currently offered a total of three identical tiger machines satisfying hardware
requirements, one had to be configured as packet sink therefore leaving only two to act as
packet sources. Such a limitation required the use of a tool capable of generating much
higher packet rates than those provided by user-level benchmarking tools for small packet
sizes, so as to not be bottlenecked by the wire-line throughput limits of gigabit Ethernet.
Similarly, existing DDoS-generating distributed software such as [Trinoo] or
[TFN] (see [DDoSTools] for a more exhaustive list) was found to be conceived for use
with a large number of source hosts. Moreover, DDoS tools were designed such that preconceived packet sequences were hard-coded within the software, such that only very
specific attacks could be initiated. While this may be satisfactory when a very large test
network is available, it is a limited solution when it comes to the use of a small number of
source hosts, as is the case in the Zoo test cluster network.
A custom solution was developed to satisfy the two fundamental requirements
mentioned above. The developed software consists of two components: pdroned and
pdroneControl (from hereon through their combined solution shall be referred to as
pdroned+pdroneControl).
A possible deployment diagram representing the pdroned+pdroneControl highlevel architecture is shown in Figure 5.
Figure 5: Possible Deployment Diagram For pdroned+pdroneControl
As depicted, the pdroned and pdroneControl applications are completely independent.
The pdroned application is meant to run as a user-level daemon on each source host. The
11
packet sink can be configured to run any set of applications or daemons and desired
metrics can be collected. Depending on the nature of the BLAST traffic used to test with,
different metric collection methods may be used.
The tests used to perform
measurements leading to the results shown in later sections of this paper consisted of
sending a large number of small UDP packets and desired metrics were collected directly
from the console of the sink host, on which code and configuration was varied, using the
netstat(1) utility (see [Netstat]).
The pdroned daemon is designed to listen for connections on a TCP socket. Once
a connection to the socket is established, no authentication is performed (this is to be
implemented in future versions). The pdroned daemon spawns a new thread to handle
each incoming connection. A client with an established connection can then send XMLencoded data to perform one of three actions (multiple data blocks may be sent over a
single established connection and are processed in the order in which they are sent):
(1) Define a packet;
(2) Define a sequence of already-defined packets;
(3) Send a command that applies to an already-defined sequence of packets.
Currently, pdroned implements merely a BLAST-SEQUENCE command. The current
implementation of the BLAST-SEQUENCE command causes pdroned to output the raw
packet sequence into a file which can then be manually manipulated, sent, or loaded into
a FreeBSD Netgraph node (see [Netgraph]) called ng_source (see [ngsource]) for rapid
output blasting via a network interface.
Listing 1 shows a typical but simple XML sequence which can be sent to pdroned
resulting in the definition of a packet, followed by the definition of a packet sequence
referring to the already-defined packet. The BLAST-COMMAND XML is not shown.
<define_packet>
<packet tag="1_example" type="ethernet">
<header_attrib name="type" value="0x0800" />
<header_attrib name="src_addr" value="00:0e:0c:09:ba:ff" />
<header_attrib name="dst_addr" value="00:07:e9:1a:69:f6" />
<payload>
<packet tag="irrelevant_1" type="ip">
<header_attrib name="version" value="0x4" />
<header_attrib name="tos" value="0x00" />
<header_attrib name="dgramlen" value="18" />
12
<header_attrib name="id" value="0x01" />
<header_attrib name="flags" value="00" />
<header_attrib name="fragoffset" value="0x0000" />
<header_attrib name="ttl" value="64" />
<header_attrib name="src_addr" value="192.168.40.15" />
<header_attrib name="dst_addr" value="192.168.40.10" />
<header_attrib name="upperproto" value="0x11" />
<payload>
<packet tag="irrelevant_2" type="udp">
<header_attrib name="src_port" value="1235" />
<header_attrib name="dst_port" value="7" />
<header_attrib name="length" value="10" />
<payload>
<packet tag="sjdkskj" type="rawdata" size="10">
1234567890
</packet>
</payload>
</packet>
</payload>
</packet>
</payload>
</packet>
</define_packet>
<define_packet_sequence>
<packet_sequence tag="1_example_sequence">
<packet_ref tag="1_example" count="1" />
</packet_sequence>
</define_packet_sequence>
Listing 1: Sample XML Data Sent to pdroned Daemon
The pdroneControl software is a sample client to one or more pdroned hosts
written in Java. The purpose of pdroneControl is to allow a user to easily create packets
and packet sequences while maintaining control of all remote drones in an organized
fashion. The highest-level graphical interface of pdroneControl is shown in Figure 6. In
Figure 6, there are four defined drones. The columns to the right of the drone names
control all of the functionality available to each drone. On the far right of the screen the
packet (top right) and sequence (bottom right) libraries are displayed. Each user-defined
packet and sequence is shown. In Figure 6, there are four defined packets, and two
defined sequences.
13
Figure 6: pdroneControl High-Level Interface
The most essential feature of pdroneControl is its extremely simple packet
creation process. These packets can be highly specified, and while the work presented
herein uses only UDP packets, the interface was constructed in a way that makes it
extensible to packets of many other types. Several pull-down menus allow the user to
choose different protocols and payload types. Figure 7 shows the high level packet
factory design. The final major high-level feature of pdroneControl is its ability to build
packet sequences. Since the blasting of packet sequences was a main concern in the
creation of a testing tool, it was essential that pdroneControl include a simple sequence
factory. Figure 8 shows the straightforward sequence factory interface.
14
Figure 7: pdroneControl Packet Factory GUI
Figure 8: pdroneControl Sequence Factory GUI
The low-level design of pdroneControl further exposes its extensibility.
Internally each packet is represented as a dynamic list of header objects as well as a data
object. All header types are subclasses of the Header class, which can easily be extended
to add additional not-yet-implemented header types. In keeping with implementation
simplicity, the low-level design of sequences merely encapsulates a dynamic list of
packets and a name. Figures 9 and 10 depict UML class diagrams for pdroneControl.
Figure 10: UML Class Diagram of
pdroneControl Packet-Sequence
Figure 9: UML Class Diagram of pdroneControl Packet
15
Measuring Status Quo FreeBSD Packet Processing Performance
The previously described packet generating solution (pdroned + pdroneControl) was
used to evaluate the status quo packet processing performance of the FreeBSD kernel.
The hardware used to perform measurements was that described in Challenges in
Measuring Network Packet Processing Performance above. Specifically, the three-node
Zoo test cluster was used in the topology configuration shown in Figure 3. The tiger1
and tiger3 machines were configured as pdroned packet sources whereas the tiger2
machine acted as traffic sink and was used to test different FreeBSD installations and
kernel options.
The metrics collected on the tiger2 machine were the number of input packets
processed by the gigabit Ethernet controllers’ interrupt service threads per second, the
number of gigabit Ethernet interface input errors per second flagged by the interrupt
service threads during input processing, and the number of packets outputted via the
gigabit Ethernet controllers per second.
The tiger1 and tiger3 packet sources were instructed via pdroned +
pdroneControl to blast small (52 Byte) sized UDP packets to UDP port 7, with the
destination corresponding to the addresses held by the tiger2 machine. Since the tiger2
machine was directly connected to the packet source machines, the packet blasting
resulted in forcing tiger2 to process as many packets as possible originating from two
independent gigabit Ethernet controllers. Additionally, tiger2 was configured to respond
to UDP port 7 with the UDP Echo service handled by the inetd daemon shipped by
default with FreeBSD. Care was taken to ensure that no rate limiting was performed by
the source tiger2 machine so as to not affect results.
Both the bleeding-edge development version FreeBSD 6.0-CURRENT as of April
2005 (HEAD branch) as well as the legacy FreeBSD 4.10-RELEASE kernels were
evaluated with respect to the aforementioned metrics. In the case of FreeBSD 6.0CURRENT, four different kernels were deployed on tiger2:
(1) A GENERIC kernel (shipped by default with FreeBSD 6.0-CURRENT as of
early April 5th 2005) without any debugging options. This kernel was named
VANILLA. It has support for multi-processor SMP systems;
16
(2) The
VANILLA
kernel
with
the
PREEMPTION
option
removed
(PREEMPTION controls full preemption in scheduling within the FreeBSD
kernel);
(3) The VANILLA kernel with the SMP option removed (Uni-Processor kernel);
(4) The VANILLA kernel with the default SCHED_4BSD scheduler replaced
with the experimental ULE scheduler available in FreeBSD 5.x and later.
Additionally, two FreeBSD 4.10-RELEASE kernels were also deployed on tiger2:
(1) A GENERIC kernel (shipped by default with FreeBSD 4.10-RELEASE).
This kernel does not have support for multi-processor SMP systems and was
used unchanged, as shipped with 4.10-RELEASE;
(2) The GENERIC kernel of 4.10-RELEASE with the SMP multi-processor
option.
For FreeBSD 6.0-CURRENT kernel, two boot-time configuration options were toggled
and combined to yield four resulting data sets. Each data set was then plotted with
respect to the three aforementioned metrics for each kernel configuration described
above. The results are shown in Figures 11 through 25.
Figure 11: VANILLA kernel Input Processing
(6.0-CURRENT)
Figure 12: VANILLA kernel, Input Iface Errors
(6.0-CURRENT)
17
Figure 13: VANILLA kernel, Output Processing
(6.0-CURRENT)
Figure 14: VANILLA no PREEMPTION, Input Processing
(6.0-CURRENT)
Figure 15: VANILLA no PREEMPTION, Input Iface
Errors (6.0-CURRENT)
Figure 16: VANILLA no PREEMPTION, Output
Processing (6.0-CURRENT)
18
Figure 17: VANILLA no SMP, Input Processing
(6.0-CURRENT)
Figure 18: VANILLA no SMP, Input Iface Errors
(6.0-CURRENT)
Figure 19: VANILLA no SMP, Output Processing
(6.0-CURRENT)
Figure 20: VANILLA w/ ULE sched, Input Processing
(6.0-CURRENT)
19
Figure 21: VANILLA w/ ULE sched, Input Iface Errors
(6.0-CURRENT)
Figure 22: VANILLA w/ ULE sched, Output Processing
(6.0-CURRENT)
Figure 23: GENERIC UP & SMP, Input Processing
(4.10-RELEASE)
Figure 24: GENERIC UP & SMP, Input Iface Errors
(4.10-RELEASE)
20
Figure 25: GENERIC UP & SMP, Output Processing (4.10-RELEASE)
Observation of Figures 11, 12, and 13 first reveals that there exists an inverse
relationship between the Input Packet Processing and Output Packet Processing metrics.
While the no-Hyperthreading (no HTT), but mpsafenet=1 scenario performs best
according to input processing numbers, it performs worst according to output processing
numbers. In fact, the reason for this is likely due to scheduling behavior. In particular, it
should be noted that currently the default FreeBSD scheduler, also known as
SCHED_4BSD, is not Hyperthreading-aware, in that it does not recognize the difference
between a logical and physical processor, for what concerns scheduling. This implies
that in a scenario where, for example, two threads are in the “RUN” state (i.e., require
immediate use of processor execution time) and 4 processors are available – of which two
are physical and two are HTT logical CPUs – the FreeBSD scheduler might in fact end
up scheduling the two threads on the secondary logical processors instead of the two real
physical ones. This leads to degraded processing performance when high-priority threads
require processing time, as is the case for input packet processing.
21
The processing of output packets on the other hand is performed by relatively
lower priority threads originating from the user-level inetd-launched Echo service.
Worse input processing performance (as in the case of the Hyperthreading-enabled
scenarios) implies that less CPU time is dedicated towards the higher-priority interrupt
threads responsible for input packets. The latter results in lower-priority output packet
processing threads obtaining more of the CPU and therefore increases output processing
numbers, at the expense of lower input processing.
Additionally noteworthy in Figure 12 is the relatively high number of input
interface errors observed by the Hyperthreading-enabled, mpsafenet=1 scenario. The
mpsafenet=1 parameter signifies that the FreeBSD 6.0-CURRENT “Giant” mutual
exclusion lock (an in-kernel structure used to temporarily protect large portions of the
FreeBSD kernel code and data until such a time when it is properly finer-grained locked)
is held over the entirety of the network code throughput processing. Setting mpsafenet=0
causes the FreeBSD kernel to not require the “Giant” mutual exclusion lock for network
processing and instead rely on smaller-scope mutual exclusion locks protecting individual
in-kernel data structures. The latter is believed to bring higher parallelism on multiprocessor architectures when multiple threads are simultaneously executing the same
code, as the probability of having to block on each smaller-scope mutual exclusion lock
is theoretically reduced (see [Baldwin]). In the HTT-enabled, mpsafenet=1 scenario, the
two network interface interrupt threads are likely un-optimally scheduled due to the
scheduler’s unawareness of Hyperthreading details, thereby explaining the lower input
processing performance. In the same scenario, it also appears that there is less execution
time for the FreeBSD netISR kernel thread, a special thread responsible for processing
input packets queued by the network controller high priority interrupt threads through the
TCP/IP network code. A less frequently executing netISR thread would cause the higherpriority interrupt threads to fill their packet interface queues, thereby causing them to flag
input interface errors and drop more packets. Indeed, this is the situation observed in
Figure 12.
The ineffectiveness of FreeBSD’s scheduler when it comes to a Hyperthreadingenabled multi-processor system’s ability to process packets is further revealed in Figures
14, 15, and 16, all of which depict the performance of a preemption-disabled kernel.
22
Particularly noteworthy is that when higher parallelism (presumably caused by
mpsafenet=1) is prevalent, and Hyperthreading is enabled, input processing performance
is worsened and input interface errors occur in higher volume. However, the side-effect
is that lower-priority packet outputting threads manage to obtain more CPU time and
thereby appear to output more packets than in all other scenarios.
Figures 17, 18, and 19 suggest that in the case of a single-processor (UP) kernel,
the effect of mpsafenet (and therefore finer or higher-grained mutual exclusion locking) is
insignificant.
Figures 20, 21, and 22 depict the influence of the experimental ULE scheduler
found in FreeBSD 6.0-CURRENT.
It should be noted that only two data sets are
depicted in these figures. The data sets corresponding to the ULE scheduler scenarios
with mpsafenet=1 (so a higher level of parallelism) resulted in temporary and
unpredictable freezing of the receiver machine (tiger2) during testing. The exact cause of
the freezing and temporary unresponsiveness of tiger2 in these scenarios is yet to be
determined and so associated results are excluded from this paper. Nonetheless, it should
be noted that in the case of the ULE scheduler and mpsafenet=0 (i.e., the use of the
“Giant” mutual exclusion lock to synchronize access to the kernel network code and
data), performance is approximately the same regardless of whether Hyperthreading is
enabled. However, when it is enabled, it appears that like with the default FreeBSD
scheduler, the netISR thread obtains less overall CPU execution time, thereby causing
input interface errors to occasionally occur.
Finally, Figures 23, 24, and 25 depict the performance of the legacy FreeBSD
4.10-RELEASE system.
Particularly noteworthy is that while input processing
performance is the same in 4.10-RELEASE regardless of whether a multi-processor or
single-processor kernel is used, output processing is slightly better in the single-processor
case.
The fact is that the legacy FreeBSD 4.10-RELEASE kernel uses temporary
interrupt masking to ensure in-kernel synchronization, a technique which was never
designed for multi-processor in-kernel scalability as a top priority (see [Lehey]). The
interrupt masking additionally causes more priority inversion, thus giving rise to higher
output processing numbers at the expense of slightly lower input processing due to the
temporary but frequent masking of network interface interrupts.
23
Conclusions on Experimental Modifications to the FreeBSD Packet Processing Code
Paths
Two experimental changes critically impacting the FreeBSD receive packet processing
code paths were considered. The first change is referred to as the UMA critical sections
optimization whereas the second is a modification of an existing but highly-experimental
mechanism in FreeBSD 6.0-CURRENT involving the net.isr.enable run-time tunable
system control variable.
The first of the considered changes was successfully
benchmarked with respect to the three metrics defined in the Measuring Status Quo
FreeBSD Packet Processing Performance section of this paper. The second change
causes a live-lock phenomenon to occur on the receiver (sink) host, tiger2, and so
performance data was not successfully collected. Work is ongoing to eliminate the livelock condition and some details are presented below.
In order to understand the potential impact of the UMA critical section
optimization some background knowledge is required. The UMA subsystem is a general
multi-processor optimized memory allocator derived from the classic Slab memory
allocator design (see [Bonwick]).
The UMA allocator is found in FreeBSD 6.0-
CURRENT and features small higher-layer per-CPU caches in order to provide a fastpath for common case allocations. Simply, if memory buffers can be found in the
executing CPU’s private UMA cache at allocation time, it will be immediately allocated.
Otherwise, a global memory buffer cache will need to be checked followed by a call into
the lower-level Virtual Memory (VM) subsystem code should the latter also be found
empty.
In the current FreeBSD 6.0-CURRENT implementation, per-CPU mutual
exclusion locks protect the per-CPU UMA caches. This has the benefit of ensuring
synchronized access to the per-CPU UMA cache should preemption occur at the time of
access. The unfortunate side-effect of using per-CPU mutual exclusion locks is their
lousy performance, even in the common case where the lock is uncontested (see
[McKenney] for a general alternative to mutual exclusion applicable to particular
scenarios for this very reason). Currently, all network buffers (as well as most other
kernel data structures or buffers) in FreeBSD 6.0-CURRENT are allocated using the
UMA allocator.
24
The UMA critical section optimization replaces the UMA per-CPU cache mutual
exclusion locks with an micro-optimized interrupt masking mechanism along with
temporary CPU thread pin-down (this consists of asking the kernel scheduler to not
migrate the currently executing thread from the currently executing CPU in the case of
preemption). The argument for the change is that optimized interrupt masking and
temporary CPU thread pin-down are operations significantly cheaper than mutual lock
acquisitions in the common case (refer to [Hlusi] for a discussion on relative cost of
SMP-related operations on contemporary Intel hardware).
Figures 26, 27, and 28 depict the performance of the UMA critical section
optimization change and the VANILLA 6.0-CURRENT kernel with respect to the three
previously defined metrics. As depicted, input processing appears to be more regular
(smoother curve) with the UMA critical section optimization than without, although input
performance is slightly worse. On the other hand, the UMA critical section optimization
shows significantly fewer input interface errors as well as significantly higher (500% on
average) output processing performance. Due to the high sensitivity of the recorded
metrics to slight nuances in scheduling behavior and the nature of the optimizations, the
results are in fact not surprising.
The UMA critical section optimization makes use of temporary preemption
disabling so as to ensure that an incoming interrupt does not cause UMA per-CPU cache
corruption during cache access. Although the argument for the change is that the net
common-case cost of an optimized temporary preemption disable accompanied by a
temporary CPU thread pin-down is lower than the common-case cost of a mutual
exclusion lock acquisition, the effect reported by the metrics in Figures 26, 27, and 28 is
actually due to the change’s impact on scheduling behavior. The temporary preemption
disable consists of interrupt masking, thereby causing a short-term priority inversion to
be performed by a low priority thread. This is in fact what happens during the tests.
Namely, the receiver (sink) is subjected to a very high interrupt load.
The more
interrupts the sink is able to process, the higher the input processing numbers will be, but
so will the input interface errors. As well, output processing numbers will be lower. The
reason for this inverse relationship is that higher interrupt processing starves out available
CPU time by preventing CPUs from processing any lower priority threads. In the case of
25
the UMA critical section optimizations, the temporary priority inversions prevent the
high number of incoming interrupts from monopolizing the processor, causing input
performance to drop but output performance and the netISR thread’s performance to
increase (this explains the lower number of input interface errors).
Figure 26
Figure 27
Figure 28
26
Precisely what is better for the purposes of mitigating DDoS depends on the
overall design of the filter and firewall software. Should the filter or firewall code paths
be run from the netISR thread or user-level code, the UMA critical section optimization is
favorable. Should the filtering code paths be executed solely from low-level interrupt
handlers in the firewall’s design, then any change increasing priority inversions should
generally be avoided.
The second experimental change to packet processing code paths in FreeBSD
involved a modification of the existing FreeBSD 6.0-CURRENT net.isr.enable system
control tunable. The default setting for the net.isr.enable tunable is zero. The default
behavior for the FreeBSD network interface receiver path is shown in Figure 29.
Figure 29: FreeBSD Receiver Packet Processing Paths
As depicted, the default behavior of the FreeBSD 6.0-CURRENT kernel is to add a
packet requiring stack processing to an appropriate interface queue and schedule the
netISR kernel thread to perform additional processing. The netISR thread is typically
launched via a context switch from the kernel scheduler should it be the next highestpriority thread on the run-queue.
In the event of setting the net.isr.enable tunable to 1, the netISR is directly
dispatched from the network interface receiver interrupt thread, so that a return to the
27
scheduler is not performed immediately. An attempt to measure the performance effects
of this behavior with respect to the three defined metrics was made but failed. The
setting of net.isr.enable to 1 in conjunction with the high packet rates and associated high
interrupt load causes complete live-lock on the receiver (sink) system. The now longerexecuting interrupt threads completely starve out all available processor time and prevent
any lower-priority threads from executing.
Further attempts must be made to properly gauge the resulting input packet
processing numbers and the benefits that net.isr.enable=1 behavior could bring to a
DDoS-mitigating solution. However, the extended analysis is beyond the scope of this
paper.
Acknowledgments
The authors would like to thank their supervisor, Anton Vinokurov, for providing space
and accommodation for TESLA, a FreeBSD 6.0-CURRENT build machine used for
pdroned+pdroneControl development and testing, as well as various preliminary kernel
testing.
Anton’s experience and advice was invaluable in the development and
realization of this project.
The authors would also like to thank the FreeBSD Project and in particular Robert
Watson and George Neville-Neil for their help and advice regarding the FreeBSD Zoo
Test Cluster used for various performance measurements, the results of which are
included in this paper. Also deserving a mention are all the donors of the FreeBSD
Project Zoo Test Cluster hardware (see [ZooCluster]). Robert Watson is also responsible
for the UMA critical section changes patches and their continued maintenance.
Last but not least, the Sandvine (http://www.sandvine.com/) engineers merit a
thank you for their help with developing and maintaining the ng_source FreeBSD kernel
module in a usable state.
Data & Source Code
The source code to pdroned+pdroneControl used to generate the data in this paper as
well as the raw data and an electronic (PDF) copy of this paper can be found at the
following HTTP address: http://people.freebsd.org/~bmilekic/quantifying_ddos/
28
References
[Mirkovic 2003] J. Mirkovic, G. Prier and P. Reiher, Challenges of Source-End DDoS Defense,
Proceedings of 2nd IEEE International Symposium on Network Computing and Applications, April 2003.
[UCLATr020018] J. Mirkovic, J. Martin and P. Reiher, A Taxonomy of DDoS Attacks and DDoS Defense
Mechanisms, UCLA CSD Technical Report CSD-TR-020018.
[DWARD] J. Mirkovic, D-WARD: Source-End Defense Against Distributed Denial-of-Service Attacks,
Ph.D. Thesis, UCLA, August 2003
[FreeBSD] [WEB] The FreeBSD Project website. URL: http://www.FreeBSD.org/
[Netperf] [WEB] The Public Netperf Performance Tool website.
URL: http://www.netperf.org/netperf/NetperfPage.html
[NetPIPE] [WEB] A Network Protocol Independent Performance Evaluator: NetPIPE Project website.
URL: http://www.scl.ameslab.gov/netpipe/
[ttcp] [WEB] PCAUSA Test TCP Benchmarking Tool for Measuring TCP and UDP Performance (Project
website). URL: http://www.pcausa.com/Utilities/pcattcp.htm
[ZooCluster] [WEB] The Zoo Test Cluster News and Updates website. URL: http://zoo.unixdaemons.com/
[Trinoo] [WEB] David Dittrich. The DoS Project’s “Trinoo” Distributed Denial of Service Attack Tool.
University of Washington public document. October 1999.
URL: http://staff.washington.edu/dittrich/misc/trinoo.analysis.txt
[TFN] [WEB] David Dittrich. The “Tribe Flood Network” Distributed Denial of Service Attack Tool.
University of Washington public document. October, 1999.
URL: http://staff.washington.edu/dittrich/misc/tfn.analysis.txt
[DDoSTools] [WEB] Advanced Networking Management Lab Distributed Denial of Service Attack Tools.
Public document from Pervasive Technology Labs at Indiana University.
URL: http://www.anml.iu.edu/ddos/tools.html
[Netstat] [WEB] The netstat(1) FreeBSD Manual Page (various authors). Available online.
URL: http://snowhite.cis.uoguelph.ca/course_info/27420/netstat.html
[Baldwin] John H. Baldwin, Locking in the Multithreaded FreeBSD Kernel, Proceedings of the USENIX
BSDCon 2002 Conference, February, 2002.
[DI-44BSD] Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman, The Design
and Implementation of the 4.4BSD Operating System, Addison-Wesley Longman, Inc, 1996.
[DI-FreeBSD] Marshall Kirk McKusick, George Neville-Neil, The Design and Implementation of the
FreeBSD Operating System, Addison-Wesley Professional, 2004.
[DDoSChain] Yun Huang, Xianjun Geng, Andrew B. Whinston, Defeating DDoS Attacks by Fixing the
Incentive Chain, University of Texas public document, April 2003.
URL: http://crec.mccombs.utexas.edu/works/articles/DDoS2005.pdf
[FTIMES] Chris Nuttall. The Financial Times. November 12, 2003. Front Page, First Section.
[Intel] [WEB] The Intel Public Website. URL: http://www.intel.com/
29
[IntelTPS] Various authors. The Intel® SE7501WV2 server board Technical Product Specification (TPS).
Intel Public Documentation. December 2003.
URL: ftp://download.intel.com/support/motherboards/server/se7501wv2/tps.pdf
[Netgraph] [WEB] The netgraph(4) FreeBSD Manual Page (various authors). Available online.
URL: http://www.elischer.org/netgraph/man/netgraph.4.html
[PCIX] [WEB] Various authors. PCI-X 2.0: High-Performance, Backward-Compatible PCI for the Future.
URL: http://www.pcisig.com/specifications/pcix_20/
[ngsource] [WEB] The ng_source(4) FreeBSD Manual Page (various authors). Available online.
URL:http://www.freebsd.org/cgi/man.cgi?query=ng_source&apropos=0&sektion=0&manpath=FreeBSD+
6.0-current&format=html
[Bonwick] Jeff Bonwick, Jonathan Adams. Magazines and Vmem: Extending the Slab Allocator to Many
CPUs and Arbitrary Resources. Proceedings of the USENIX 2001 Annual Technical Conference. June
2001.
[McKenney] Paul E. McKenney, Kernel Korner – Using RCU in the Linux 2.5 Kernel. Linux Journal,
September 2003. Available online.
URL: http://linuxjournal.com/article/6993
[McKenney2] Paul E. McKenney, Read-Copy-Update: Using Execution History to Solve Concurrency
Problems, Unpublished.
URL: http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html
[Lehey] Greg Lehey, Improving the FreeBSD SMP Implementation, Proceedings of the USENIX 2001
Annual Technical Conference: Freenix Track, 2001.
[Hlusi] Jiri Hlusi, Symmetric Multi-Processing (SMP) Systems On Top of Contemporary Intel Appliances,
University of Tampere, Department of Computer and Information Sciences, Pro gradu. December 2002.
[Juniper] [WEB] Juniper Networks public website. URL: http://www.juniper.net/
[82546EB] [WEB] Various authors. Intel 82546EB Gigabit Ethernet Controller Specification Update.
Available online. URL: http://www.intel.com/design/network/specupdt/82546eb.htm
30
Download