Quantifying Design Aspects of a Platform for Distributed Denial-of-Service Attack Mitigation Bosko Milekic <bmilekic@FreeBSD.org> Bram Edward Sugarman <bsugar@po-box.mcgill.ca> Supervised by Anton L. Vinokurov <avinok@tsp.ece.mcgill.ca> April, 2005. McGill University, Montreal, Canada. Abstract Distributed Denial-of-Service (DDoS) attacks are a more and more common problem facing Internet-exposed organizations today. One of the greatest challenges in the design and quantitative measurement of a platform for mitigating DDoS attacks is the ability to generate the required traffic patterns and loads required in a real network setup so as to notice performance impacts of modifications made. This paper presents the methodology used to test the network packet processing performance of the beginnings of a solution to the DDoS problem. The design overview of a custom packet generating solution is presented. The packet processing performance effects of various custom kernel options and a couple of modifications to the packet processing path in a popular Open-Source Operating System are considered. Introduction to DDoS: Definition & Relevance Distributed Denial-of-Service (DDoS) attacks have become more and more prevalent throughout the past five years as high-speed Internet connectivity has become widespread. A DDoS attack is defined as follows: Multiple attack sites overwhelming a victim site with a large quantity of traffic. Victim’s resources are exhausted such that legitimate clients are refused service. [Mirkovic 2003] It is important to note from the above definition that the victim’s resources include not only available network bandwidth toward the victim’s site but also the memory resources allocated to service said malicious traffic along the path to the attack target host (typically an end-host) and the processing delays caused throughout. 1 Figure 1 below shows a DDoS attack directed at a target victim located on the public Internet within the local victim site topology shown. Figure 1: Example of a DDoS Attack at a Multi-Homed Victim End-Host It should be observed that the victim site is multi-homed by two connectivity providers: Connection Provider A and Connection Provider B. Equally noteworthy is that the depicted DDoS attack is coordinated by a single master host and 5 attack drone hosts, but that actual DDoS attacks can and often do involve thousands if not hundreds of thousands of attack hosts. The attack depicted in Figure 1 has as ultimate objective the prevention of legitimate traffic flow to and from the Victim End-Host. Such a situation is achieved either by overwhelming or exploiting resources at the End-Host itself or by exhausting all available bandwidth in both Customer Link A and Customer Link B. Although the fact that the victim’s site is multi-homed may act in the victim’s favor, sufficiently strong attacks will lead to at least one of the two available links becoming exhausted, at which point traffic shall overflow to the remaining link thereby exhausting it as well. Descriptions and taxonomy of popular DDoS attacks and some mitigation techniques can be found in [Mirkovic 2003] and [UCLATr020018]. The seriousness of DDoS attacks should not be underestimated. The attacks are virtually unstoppable and are therefore currently the single largest threat to Internet2 exposed organizations, often capable of bringing them offline for extended periods of time. This is evidenced by the following snippet from a front-page article in The Financial Times: “More than a dozen offshore gambling sites serving the US market were hit by the so-called Distributed Denial of Service (DDoS) attacks and extortion demands in September. Sites have been asked to pay up to $50,000 to ensure they are free from attacks for a year. Police are urging victims not to give in to blackmail and to report the crime.” (source: [FTIMES]) Requirements for Effectively Mitigating DDoS Currently a variety of techniques are being employed in hopes of mitigating DDoS attacks. There are two distinct fronts on which these approaches attempt to combat the attack: the first is the source-end, and the second is at the victim (customer) site. One common, source-end solution is the IP source-routing tactic being deployed by Internet Service Providers (ISPs) and traffic carriers. Since carriers typically have a large routing setup, they are able to easily examine packets before routing them to the public Internet. This is precisely the idea behind source routing: each packet attempting to leave the ISP’s network has its source IP address examined. If the address is deemed to have originated from one of the carrier’s own networks it is forwarded normally; however, if it is determined that the source IP address is not from within any of the ISP’s networks the packet is dropped and will never reach its target. In the case of ordinary Denial of Service attacks the attacker will often make use of IP spoofing: the falsification of an IP packet’s source address. In such instances Source Routing is very effective. However, in dealing with Distributed DoS attacks this tactic is essentially useless as IP spoofing is rarely necessary (the attack traffic is originating from otherwise legitimate hosts, often unbeknownst to their owners). When the overwhelming amount of attack traffic reaches the routers it will be recognized as legitimate traffic and promptly forwarded on its way toward the victim site. For this reason the detection of DDoS attacks is very difficult at the source. The difficulty of proper source-end detection is evidenced by much active research in this area (see for example [Mirkovic 2003]). 3 It is understandable that historically most solutions are placed at the point where the attack has its most profound impact: the victim site. Solutions that take the approaches of attempting to mitigate DDoS attacks at the victim site often involve employing sophisticated firewalls and packet filtering techniques. Such techniques are in theory extremely effective; however, in practice other bottlenecks, such as processing overhead, cause the victim site to drop legitimate traffic. Furthermore this type of approach mostly transfers the point of failure from the victim site itself to the point of the attempted solution. Figure 2 below illustrates the typical topology of these types of solutions. Through careful analysis it is realized that this approach is adheres to this transfer of failure point and is in fact inadequate. Figure 2: Typical Customer-End Firewall Topology The first downfall of the situation depicted in Figure 2 begins at the firewall itself. As the DDoS attack rapidly propagates its massive amount of traffic, the firewall becomes overwhelmed due to the packet load. In this case the firewall itself becomes the bottleneck leading to failure: here the firewall is unable to examine the large number of arriving packets fast enough, causing an unmanageable build-up of traffic. As the buildup becomes unbearable to the victim site, its resources are rapidly consumed. The incredible amount of illegitimate traffic surged upon the victim has rendered the firewall essentially useless, successfully monopolizing resources. 4 The second breakdown of the type of solution modeled by Figure 2 arises along the pipe connecting the firewall to the public Internet. In the figure a DS3 line is depicted representing this potential tunnel of disaster. The impending failure that arises in these pipes during DDoS attacks occurs when they quickly fill due to aggregate traffic. In this case, the build-up propagates along this pipe all the way to routers injecting the traffic into the pipe. The queues of these routers rapidly fill due to their inability to send traffic along the filled pipes and as a consequence they begin to drop packets. Once again the DDoS attack has succeeded, and legitimate traffic destined for the victim site is refused service. While it would seem that the world is doomed to live in fear of DDoS attacks forever, there is actually some active research in the field of DDoS mitigation that is displaying some impressive potential. The solutions proposed in these projects focus on filtering in firewalls. Here remarkable distributed and learning detection algorithms are created; these algorithms aim to discern legitimate from illegitimate traffic in real-time. Research of this variety is not only a step in the right direction, it is very important to the ultimate creation of a DDoS mitigation solution. One of the most important attributes of such a solution will undoubtedly be its ability to differentiate legitimate traffic from illegitimate malicious traffic. It is also important to remember that there are many different types of DDoS attacks, of which very few can be handled solely by the victim. This fact exerts the need for a distributed and possibly coordinated solution (such as [DWARD]). Furthermore, research of this type is especially interesting from an academic standpoint. This mainly attributed to the esteem of the fancy algorithms and fancy database structures produced. One such detection and filtering solution is proposed by the D-WARD project: D-WARD (DDoS Network Attack Recognition and Defense) is a DDoS defense system deployed at a source-end network. Its Goal is twofold: 1. Detect outgoing DDoS attacks and stop them by controlling outgoing traffic to the victim. 2. Provide good service to legitimate transactions between the deploying network and the victim while the attack is on-going. ([DWARD]) 5 The D-WARD approach is much like most other source-end solutions in that it is installed at source routers, in between the attack deploying network and the rest of the Internet. In fact the ultimate goal of D-WARD is to have its firewalls installed at a large number of such gateways. Once this is instituted, distributed algorithms involving many D-WARD firewalls are able to collaborate on filter rule-sets. The theoretical aspects of the D-WARD solution are seemingly invaluable; however, from a real world perspective source-end solutions are difficult to sell. Sadly, there is little incentive for carriers to widely deploy source-end solutions that have little immediate benefit for their own customers (see [DDoSChain] for a proposed economic solution to this incentive chain problem). Although D-WARD and other areas of academic research have shown encouraging results, they all fail to address an important aspect of an effective DDoS mitigation solution: packet processing. The fact is that if packet processing is slow, then firewall and filtering mitigation methods are useless. This is mainly attributed to the fact that if the packets are being passed to the firewall at a rate higher than the firewall is able to process them, the packet processing becomes the bottleneck; in this case all efforts towards improving the firewall’s filtering and inspection techniques are rendered practically useless. It is evident that for a realistic DDoS-mitigating solution two distinct high-level requirements are needed. The first is the ability to detect and discern illegitimate traffic and subsequently install firewall rules/filters to block it. This is the type of work being actively researched in solutions such as D-WARD. The second aspect is the ability to process packets quickly and efficiently match them against large number of rules. The research presented in the remainder of this paper focuses directly on this second challenge. Challenges in Measuring Network Packet Processing Performance It is often thought that hardware offloading should solve the packet processing bottleneck; however, in the case of DDoS attacks, hardware offloading is not as useful as in High-End IP Router design (see [Juniper] high-end IP router architecture). In general ASICs and hardware offloading do not work well for firewalls, particularly those that aim 6 to protect against DDoS attacks. DDoS attacks can be composed of many different types of traffic, some of which require deep packet inspection to filter, and others which can be filtered merely with shallow network-layer header inspection. This conflicting pair of requirements is why it is difficult in anti-DDoS firewall design to find a common case to offload to hardware. Essentially one would need to build the features and flexibility of the operating system and software-based firewall into the hardware to appropriately deal with varying types of packets. This need in turn causes the hardware requirements for the offloading unit to increase thereby invalidating the hardware-offloading design pattern. A firewall that could successfully offload network-layer header inspection to hardware could therefore be rendered useless by crafting an attack that cannot be filtered with merely network-layer header inspection. With this in mind, it is more natural to consider an off-the-shelf server-grade shared-memory multi-processor (SMP) system as a platform for a DDoS-mitigating firewall, rather than a network device with custom circuitry. For this purpose, the opensource FreeBSD operating system which has ongoing and active development for SMP hardware was an obvious choice (refer to [FreeBSD] and [Lehey] for more information). The remainder of this paper considers two fundamental challenges in packet processing performance improvement for the beginnings of a DDoS-mitigating solution: (1) Determining the status-quo packet processing performance of the FreeBSD operating system on SMP hardware; (2) Considering an experimental modification to the FreeBSD developmentbranch kernel (operating system core software) and gauging its impact on packet processing. In order to accurately accomplish the primary goal of measuring the FreeBSD system’s current packet processing abilities, the following was required: (1) Set up or make use of a test network environment in which sufficient traffic load could be generated so as to stress a target FreeBSD SMP system’s packet processing code paths; (2) Ensure that the target FreeBSD SMP system’s hardware is not the limiting factor in testing; 7 (3) Ensure that the packet generation infrastructure (i.e., packet source hosts) within the test network are capable of generating sufficiently high packet rates without bottlenecking on the wire-speed limits of the test network and that flexible traffic patterns can be easily generated. Requirements (1) and (2) are discussed in this section while requirement (3) is addressed in the following section entitled Design & Implementation of a Packet Factory & Generator Software Solution. The test cluster used for performing packet processing testing and measurements for the work presented herein belongs to the FreeBSD Project and is commonly referred to as the Zoo test bed or cluster (see [ZooCluster] for more information). The current topology of the Zoo test cluster is depicted in Figure 3. Figure 3: Current Zoo Test Cluster Network Topology While various different machines are pictured in Figure 3, only the 3 identical tiger machines were used in packet processing performance data gathering. The tiger machines consist of three hardware-identical SMP servers with four gigabit Ethernet interfaces each, such that one gigabit Ethernet interface is connected to a management switch (“MGMT switch” in Figure 3) thereby contributing to the internal management and configuration network for all Zoo cluster machines. Of the remaining three network interfaces on each tiger, two are used to directly connect to each of the remaining two 8 tiger hosts. Therefore, all three tiger machines are completely directly interconnected thereby forming a gigabit Ethernet loop, ensuring that no switching elements exist between any hosts and eliminating the possibility that a switching element becomes the bottleneck. As mentioned, the three tiger hosts used for testing are all SMP-capable servers and are equipped with identical server-grade hardware. Each tiger machine features dual Intel Xeon 3.06GHz Hyperthreading (HTT)-capable CPUs on an Intel Server Board SE7501wv2 which includes three PCI buses and a 533MHz front-side-bus (see [Intel] for more information). Each machine has 4GB of RAM. It is important to note that the Intel SE7501wv2 board has a built-in Intel 82546EB dual-port gigabit Ethernet controller and that the machines were also equipped with a second dual-port 82546EB in the form of a Low-Profile 64-bit PCI-X card. The latter fact is important to remember because it ensures that each 82546EB sits on a separate uncontested 64-bit PCI bus, thereby eliminating the possibility of bottlenecking in hardware during high-frequency 82546EBto-CPU or RAM transfers which occur as a result of high packet rates to which the tiger machines will be subjected to. Figure 4: Intel Server Board SE7501wv2 Block Diagram (Source: [IntelTPS]) 9 The bottom of Figure 4 depicts the P64H2 on the Intel SE7501wv2 board being connected via PCI-X to the dual-port on-board 82546EB (on the left) and the Full-Profile PCI-X slot. It also shows that the P64H2 is connected (on the right) to the Low-Profile PCI-X slot and the SCSI Ultra320 controller found on some versions of the board. The versions of the Intel SE7501wv2 found in the tiger machines do not include the on-board SCSI Ultra320 controller and so, in fact, the right-side connection to the P64H2 shown in Figure 4 only connects the Low-Profile PCI-X slot. In the tiger machines, the FullProfile PCI-X slot is left empty while the Low-Profile slot sports the PCI-X card version of the dual-port 82546EB. Supposing that the PCI-X buses to the P64H2 are clocked at 100MHz, and for the moment neglecting PCI-X bus acquisition overhead, the amount of data transferable via the 64-bit wide bus per second is: 100000000 6400000000bits 64bits s s Since the 82546EB controllers are dual-port, they require at worse 2Gbits bus bandwidth (neglecting required PCI-X bus acquisition overhead). The bus itself, since uncontested, can provide approximately 6.4Gbps. The difference, 4.4Gbps, is more than sufficient to handle bus acquisition overhead when the bus is uncontested, even in the most conservative of guesses (see [PCIX] for justification). The fact that the 82546EB controllers perform significant traffic coalescing (see [82546EB]) further ensures that not every packet results in a bus transaction, particularly at high packet rates, thereby lowering the associated and neglected overhead. Design & Implementation of a Packet Factory & Generator Software Solution The two fundamental high-level requirements for a custom packet generating solution were first that it allowed for the creation of arbitrary packets and sequences of said packets and second that it provided a means to send constructed sequences to a specified destination at very high rates, such that the aggregate packet rate produced by only two sources approached wire-line limits of gigabit Ethernet even for small packet sizes. Although efforts such as [Netperf], [NetPIPE], and [ttcp] offered adequate means for generating test UDP and TCP traffic, they were not able to produce the necessary packet load from merely two source hosts. Since the Zoo test cluster (see [ZooCluster]) 10 currently offered a total of three identical tiger machines satisfying hardware requirements, one had to be configured as packet sink therefore leaving only two to act as packet sources. Such a limitation required the use of a tool capable of generating much higher packet rates than those provided by user-level benchmarking tools for small packet sizes, so as to not be bottlenecked by the wire-line throughput limits of gigabit Ethernet. Similarly, existing DDoS-generating distributed software such as [Trinoo] or [TFN] (see [DDoSTools] for a more exhaustive list) was found to be conceived for use with a large number of source hosts. Moreover, DDoS tools were designed such that preconceived packet sequences were hard-coded within the software, such that only very specific attacks could be initiated. While this may be satisfactory when a very large test network is available, it is a limited solution when it comes to the use of a small number of source hosts, as is the case in the Zoo test cluster network. A custom solution was developed to satisfy the two fundamental requirements mentioned above. The developed software consists of two components: pdroned and pdroneControl (from hereon through their combined solution shall be referred to as pdroned+pdroneControl). A possible deployment diagram representing the pdroned+pdroneControl highlevel architecture is shown in Figure 5. Figure 5: Possible Deployment Diagram For pdroned+pdroneControl As depicted, the pdroned and pdroneControl applications are completely independent. The pdroned application is meant to run as a user-level daemon on each source host. The 11 packet sink can be configured to run any set of applications or daemons and desired metrics can be collected. Depending on the nature of the BLAST traffic used to test with, different metric collection methods may be used. The tests used to perform measurements leading to the results shown in later sections of this paper consisted of sending a large number of small UDP packets and desired metrics were collected directly from the console of the sink host, on which code and configuration was varied, using the netstat(1) utility (see [Netstat]). The pdroned daemon is designed to listen for connections on a TCP socket. Once a connection to the socket is established, no authentication is performed (this is to be implemented in future versions). The pdroned daemon spawns a new thread to handle each incoming connection. A client with an established connection can then send XMLencoded data to perform one of three actions (multiple data blocks may be sent over a single established connection and are processed in the order in which they are sent): (1) Define a packet; (2) Define a sequence of already-defined packets; (3) Send a command that applies to an already-defined sequence of packets. Currently, pdroned implements merely a BLAST-SEQUENCE command. The current implementation of the BLAST-SEQUENCE command causes pdroned to output the raw packet sequence into a file which can then be manually manipulated, sent, or loaded into a FreeBSD Netgraph node (see [Netgraph]) called ng_source (see [ngsource]) for rapid output blasting via a network interface. Listing 1 shows a typical but simple XML sequence which can be sent to pdroned resulting in the definition of a packet, followed by the definition of a packet sequence referring to the already-defined packet. The BLAST-COMMAND XML is not shown. <define_packet> <packet tag="1_example" type="ethernet"> <header_attrib name="type" value="0x0800" /> <header_attrib name="src_addr" value="00:0e:0c:09:ba:ff" /> <header_attrib name="dst_addr" value="00:07:e9:1a:69:f6" /> <payload> <packet tag="irrelevant_1" type="ip"> <header_attrib name="version" value="0x4" /> <header_attrib name="tos" value="0x00" /> <header_attrib name="dgramlen" value="18" /> 12 <header_attrib name="id" value="0x01" /> <header_attrib name="flags" value="00" /> <header_attrib name="fragoffset" value="0x0000" /> <header_attrib name="ttl" value="64" /> <header_attrib name="src_addr" value="192.168.40.15" /> <header_attrib name="dst_addr" value="192.168.40.10" /> <header_attrib name="upperproto" value="0x11" /> <payload> <packet tag="irrelevant_2" type="udp"> <header_attrib name="src_port" value="1235" /> <header_attrib name="dst_port" value="7" /> <header_attrib name="length" value="10" /> <payload> <packet tag="sjdkskj" type="rawdata" size="10"> 1234567890 </packet> </payload> </packet> </payload> </packet> </payload> </packet> </define_packet> <define_packet_sequence> <packet_sequence tag="1_example_sequence"> <packet_ref tag="1_example" count="1" /> </packet_sequence> </define_packet_sequence> Listing 1: Sample XML Data Sent to pdroned Daemon The pdroneControl software is a sample client to one or more pdroned hosts written in Java. The purpose of pdroneControl is to allow a user to easily create packets and packet sequences while maintaining control of all remote drones in an organized fashion. The highest-level graphical interface of pdroneControl is shown in Figure 6. In Figure 6, there are four defined drones. The columns to the right of the drone names control all of the functionality available to each drone. On the far right of the screen the packet (top right) and sequence (bottom right) libraries are displayed. Each user-defined packet and sequence is shown. In Figure 6, there are four defined packets, and two defined sequences. 13 Figure 6: pdroneControl High-Level Interface The most essential feature of pdroneControl is its extremely simple packet creation process. These packets can be highly specified, and while the work presented herein uses only UDP packets, the interface was constructed in a way that makes it extensible to packets of many other types. Several pull-down menus allow the user to choose different protocols and payload types. Figure 7 shows the high level packet factory design. The final major high-level feature of pdroneControl is its ability to build packet sequences. Since the blasting of packet sequences was a main concern in the creation of a testing tool, it was essential that pdroneControl include a simple sequence factory. Figure 8 shows the straightforward sequence factory interface. 14 Figure 7: pdroneControl Packet Factory GUI Figure 8: pdroneControl Sequence Factory GUI The low-level design of pdroneControl further exposes its extensibility. Internally each packet is represented as a dynamic list of header objects as well as a data object. All header types are subclasses of the Header class, which can easily be extended to add additional not-yet-implemented header types. In keeping with implementation simplicity, the low-level design of sequences merely encapsulates a dynamic list of packets and a name. Figures 9 and 10 depict UML class diagrams for pdroneControl. Figure 10: UML Class Diagram of pdroneControl Packet-Sequence Figure 9: UML Class Diagram of pdroneControl Packet 15 Measuring Status Quo FreeBSD Packet Processing Performance The previously described packet generating solution (pdroned + pdroneControl) was used to evaluate the status quo packet processing performance of the FreeBSD kernel. The hardware used to perform measurements was that described in Challenges in Measuring Network Packet Processing Performance above. Specifically, the three-node Zoo test cluster was used in the topology configuration shown in Figure 3. The tiger1 and tiger3 machines were configured as pdroned packet sources whereas the tiger2 machine acted as traffic sink and was used to test different FreeBSD installations and kernel options. The metrics collected on the tiger2 machine were the number of input packets processed by the gigabit Ethernet controllers’ interrupt service threads per second, the number of gigabit Ethernet interface input errors per second flagged by the interrupt service threads during input processing, and the number of packets outputted via the gigabit Ethernet controllers per second. The tiger1 and tiger3 packet sources were instructed via pdroned + pdroneControl to blast small (52 Byte) sized UDP packets to UDP port 7, with the destination corresponding to the addresses held by the tiger2 machine. Since the tiger2 machine was directly connected to the packet source machines, the packet blasting resulted in forcing tiger2 to process as many packets as possible originating from two independent gigabit Ethernet controllers. Additionally, tiger2 was configured to respond to UDP port 7 with the UDP Echo service handled by the inetd daemon shipped by default with FreeBSD. Care was taken to ensure that no rate limiting was performed by the source tiger2 machine so as to not affect results. Both the bleeding-edge development version FreeBSD 6.0-CURRENT as of April 2005 (HEAD branch) as well as the legacy FreeBSD 4.10-RELEASE kernels were evaluated with respect to the aforementioned metrics. In the case of FreeBSD 6.0CURRENT, four different kernels were deployed on tiger2: (1) A GENERIC kernel (shipped by default with FreeBSD 6.0-CURRENT as of early April 5th 2005) without any debugging options. This kernel was named VANILLA. It has support for multi-processor SMP systems; 16 (2) The VANILLA kernel with the PREEMPTION option removed (PREEMPTION controls full preemption in scheduling within the FreeBSD kernel); (3) The VANILLA kernel with the SMP option removed (Uni-Processor kernel); (4) The VANILLA kernel with the default SCHED_4BSD scheduler replaced with the experimental ULE scheduler available in FreeBSD 5.x and later. Additionally, two FreeBSD 4.10-RELEASE kernels were also deployed on tiger2: (1) A GENERIC kernel (shipped by default with FreeBSD 4.10-RELEASE). This kernel does not have support for multi-processor SMP systems and was used unchanged, as shipped with 4.10-RELEASE; (2) The GENERIC kernel of 4.10-RELEASE with the SMP multi-processor option. For FreeBSD 6.0-CURRENT kernel, two boot-time configuration options were toggled and combined to yield four resulting data sets. Each data set was then plotted with respect to the three aforementioned metrics for each kernel configuration described above. The results are shown in Figures 11 through 25. Figure 11: VANILLA kernel Input Processing (6.0-CURRENT) Figure 12: VANILLA kernel, Input Iface Errors (6.0-CURRENT) 17 Figure 13: VANILLA kernel, Output Processing (6.0-CURRENT) Figure 14: VANILLA no PREEMPTION, Input Processing (6.0-CURRENT) Figure 15: VANILLA no PREEMPTION, Input Iface Errors (6.0-CURRENT) Figure 16: VANILLA no PREEMPTION, Output Processing (6.0-CURRENT) 18 Figure 17: VANILLA no SMP, Input Processing (6.0-CURRENT) Figure 18: VANILLA no SMP, Input Iface Errors (6.0-CURRENT) Figure 19: VANILLA no SMP, Output Processing (6.0-CURRENT) Figure 20: VANILLA w/ ULE sched, Input Processing (6.0-CURRENT) 19 Figure 21: VANILLA w/ ULE sched, Input Iface Errors (6.0-CURRENT) Figure 22: VANILLA w/ ULE sched, Output Processing (6.0-CURRENT) Figure 23: GENERIC UP & SMP, Input Processing (4.10-RELEASE) Figure 24: GENERIC UP & SMP, Input Iface Errors (4.10-RELEASE) 20 Figure 25: GENERIC UP & SMP, Output Processing (4.10-RELEASE) Observation of Figures 11, 12, and 13 first reveals that there exists an inverse relationship between the Input Packet Processing and Output Packet Processing metrics. While the no-Hyperthreading (no HTT), but mpsafenet=1 scenario performs best according to input processing numbers, it performs worst according to output processing numbers. In fact, the reason for this is likely due to scheduling behavior. In particular, it should be noted that currently the default FreeBSD scheduler, also known as SCHED_4BSD, is not Hyperthreading-aware, in that it does not recognize the difference between a logical and physical processor, for what concerns scheduling. This implies that in a scenario where, for example, two threads are in the “RUN” state (i.e., require immediate use of processor execution time) and 4 processors are available – of which two are physical and two are HTT logical CPUs – the FreeBSD scheduler might in fact end up scheduling the two threads on the secondary logical processors instead of the two real physical ones. This leads to degraded processing performance when high-priority threads require processing time, as is the case for input packet processing. 21 The processing of output packets on the other hand is performed by relatively lower priority threads originating from the user-level inetd-launched Echo service. Worse input processing performance (as in the case of the Hyperthreading-enabled scenarios) implies that less CPU time is dedicated towards the higher-priority interrupt threads responsible for input packets. The latter results in lower-priority output packet processing threads obtaining more of the CPU and therefore increases output processing numbers, at the expense of lower input processing. Additionally noteworthy in Figure 12 is the relatively high number of input interface errors observed by the Hyperthreading-enabled, mpsafenet=1 scenario. The mpsafenet=1 parameter signifies that the FreeBSD 6.0-CURRENT “Giant” mutual exclusion lock (an in-kernel structure used to temporarily protect large portions of the FreeBSD kernel code and data until such a time when it is properly finer-grained locked) is held over the entirety of the network code throughput processing. Setting mpsafenet=0 causes the FreeBSD kernel to not require the “Giant” mutual exclusion lock for network processing and instead rely on smaller-scope mutual exclusion locks protecting individual in-kernel data structures. The latter is believed to bring higher parallelism on multiprocessor architectures when multiple threads are simultaneously executing the same code, as the probability of having to block on each smaller-scope mutual exclusion lock is theoretically reduced (see [Baldwin]). In the HTT-enabled, mpsafenet=1 scenario, the two network interface interrupt threads are likely un-optimally scheduled due to the scheduler’s unawareness of Hyperthreading details, thereby explaining the lower input processing performance. In the same scenario, it also appears that there is less execution time for the FreeBSD netISR kernel thread, a special thread responsible for processing input packets queued by the network controller high priority interrupt threads through the TCP/IP network code. A less frequently executing netISR thread would cause the higherpriority interrupt threads to fill their packet interface queues, thereby causing them to flag input interface errors and drop more packets. Indeed, this is the situation observed in Figure 12. The ineffectiveness of FreeBSD’s scheduler when it comes to a Hyperthreadingenabled multi-processor system’s ability to process packets is further revealed in Figures 14, 15, and 16, all of which depict the performance of a preemption-disabled kernel. 22 Particularly noteworthy is that when higher parallelism (presumably caused by mpsafenet=1) is prevalent, and Hyperthreading is enabled, input processing performance is worsened and input interface errors occur in higher volume. However, the side-effect is that lower-priority packet outputting threads manage to obtain more CPU time and thereby appear to output more packets than in all other scenarios. Figures 17, 18, and 19 suggest that in the case of a single-processor (UP) kernel, the effect of mpsafenet (and therefore finer or higher-grained mutual exclusion locking) is insignificant. Figures 20, 21, and 22 depict the influence of the experimental ULE scheduler found in FreeBSD 6.0-CURRENT. It should be noted that only two data sets are depicted in these figures. The data sets corresponding to the ULE scheduler scenarios with mpsafenet=1 (so a higher level of parallelism) resulted in temporary and unpredictable freezing of the receiver machine (tiger2) during testing. The exact cause of the freezing and temporary unresponsiveness of tiger2 in these scenarios is yet to be determined and so associated results are excluded from this paper. Nonetheless, it should be noted that in the case of the ULE scheduler and mpsafenet=0 (i.e., the use of the “Giant” mutual exclusion lock to synchronize access to the kernel network code and data), performance is approximately the same regardless of whether Hyperthreading is enabled. However, when it is enabled, it appears that like with the default FreeBSD scheduler, the netISR thread obtains less overall CPU execution time, thereby causing input interface errors to occasionally occur. Finally, Figures 23, 24, and 25 depict the performance of the legacy FreeBSD 4.10-RELEASE system. Particularly noteworthy is that while input processing performance is the same in 4.10-RELEASE regardless of whether a multi-processor or single-processor kernel is used, output processing is slightly better in the single-processor case. The fact is that the legacy FreeBSD 4.10-RELEASE kernel uses temporary interrupt masking to ensure in-kernel synchronization, a technique which was never designed for multi-processor in-kernel scalability as a top priority (see [Lehey]). The interrupt masking additionally causes more priority inversion, thus giving rise to higher output processing numbers at the expense of slightly lower input processing due to the temporary but frequent masking of network interface interrupts. 23 Conclusions on Experimental Modifications to the FreeBSD Packet Processing Code Paths Two experimental changes critically impacting the FreeBSD receive packet processing code paths were considered. The first change is referred to as the UMA critical sections optimization whereas the second is a modification of an existing but highly-experimental mechanism in FreeBSD 6.0-CURRENT involving the net.isr.enable run-time tunable system control variable. The first of the considered changes was successfully benchmarked with respect to the three metrics defined in the Measuring Status Quo FreeBSD Packet Processing Performance section of this paper. The second change causes a live-lock phenomenon to occur on the receiver (sink) host, tiger2, and so performance data was not successfully collected. Work is ongoing to eliminate the livelock condition and some details are presented below. In order to understand the potential impact of the UMA critical section optimization some background knowledge is required. The UMA subsystem is a general multi-processor optimized memory allocator derived from the classic Slab memory allocator design (see [Bonwick]). The UMA allocator is found in FreeBSD 6.0- CURRENT and features small higher-layer per-CPU caches in order to provide a fastpath for common case allocations. Simply, if memory buffers can be found in the executing CPU’s private UMA cache at allocation time, it will be immediately allocated. Otherwise, a global memory buffer cache will need to be checked followed by a call into the lower-level Virtual Memory (VM) subsystem code should the latter also be found empty. In the current FreeBSD 6.0-CURRENT implementation, per-CPU mutual exclusion locks protect the per-CPU UMA caches. This has the benefit of ensuring synchronized access to the per-CPU UMA cache should preemption occur at the time of access. The unfortunate side-effect of using per-CPU mutual exclusion locks is their lousy performance, even in the common case where the lock is uncontested (see [McKenney] for a general alternative to mutual exclusion applicable to particular scenarios for this very reason). Currently, all network buffers (as well as most other kernel data structures or buffers) in FreeBSD 6.0-CURRENT are allocated using the UMA allocator. 24 The UMA critical section optimization replaces the UMA per-CPU cache mutual exclusion locks with an micro-optimized interrupt masking mechanism along with temporary CPU thread pin-down (this consists of asking the kernel scheduler to not migrate the currently executing thread from the currently executing CPU in the case of preemption). The argument for the change is that optimized interrupt masking and temporary CPU thread pin-down are operations significantly cheaper than mutual lock acquisitions in the common case (refer to [Hlusi] for a discussion on relative cost of SMP-related operations on contemporary Intel hardware). Figures 26, 27, and 28 depict the performance of the UMA critical section optimization change and the VANILLA 6.0-CURRENT kernel with respect to the three previously defined metrics. As depicted, input processing appears to be more regular (smoother curve) with the UMA critical section optimization than without, although input performance is slightly worse. On the other hand, the UMA critical section optimization shows significantly fewer input interface errors as well as significantly higher (500% on average) output processing performance. Due to the high sensitivity of the recorded metrics to slight nuances in scheduling behavior and the nature of the optimizations, the results are in fact not surprising. The UMA critical section optimization makes use of temporary preemption disabling so as to ensure that an incoming interrupt does not cause UMA per-CPU cache corruption during cache access. Although the argument for the change is that the net common-case cost of an optimized temporary preemption disable accompanied by a temporary CPU thread pin-down is lower than the common-case cost of a mutual exclusion lock acquisition, the effect reported by the metrics in Figures 26, 27, and 28 is actually due to the change’s impact on scheduling behavior. The temporary preemption disable consists of interrupt masking, thereby causing a short-term priority inversion to be performed by a low priority thread. This is in fact what happens during the tests. Namely, the receiver (sink) is subjected to a very high interrupt load. The more interrupts the sink is able to process, the higher the input processing numbers will be, but so will the input interface errors. As well, output processing numbers will be lower. The reason for this inverse relationship is that higher interrupt processing starves out available CPU time by preventing CPUs from processing any lower priority threads. In the case of 25 the UMA critical section optimizations, the temporary priority inversions prevent the high number of incoming interrupts from monopolizing the processor, causing input performance to drop but output performance and the netISR thread’s performance to increase (this explains the lower number of input interface errors). Figure 26 Figure 27 Figure 28 26 Precisely what is better for the purposes of mitigating DDoS depends on the overall design of the filter and firewall software. Should the filter or firewall code paths be run from the netISR thread or user-level code, the UMA critical section optimization is favorable. Should the filtering code paths be executed solely from low-level interrupt handlers in the firewall’s design, then any change increasing priority inversions should generally be avoided. The second experimental change to packet processing code paths in FreeBSD involved a modification of the existing FreeBSD 6.0-CURRENT net.isr.enable system control tunable. The default setting for the net.isr.enable tunable is zero. The default behavior for the FreeBSD network interface receiver path is shown in Figure 29. Figure 29: FreeBSD Receiver Packet Processing Paths As depicted, the default behavior of the FreeBSD 6.0-CURRENT kernel is to add a packet requiring stack processing to an appropriate interface queue and schedule the netISR kernel thread to perform additional processing. The netISR thread is typically launched via a context switch from the kernel scheduler should it be the next highestpriority thread on the run-queue. In the event of setting the net.isr.enable tunable to 1, the netISR is directly dispatched from the network interface receiver interrupt thread, so that a return to the 27 scheduler is not performed immediately. An attempt to measure the performance effects of this behavior with respect to the three defined metrics was made but failed. The setting of net.isr.enable to 1 in conjunction with the high packet rates and associated high interrupt load causes complete live-lock on the receiver (sink) system. The now longerexecuting interrupt threads completely starve out all available processor time and prevent any lower-priority threads from executing. Further attempts must be made to properly gauge the resulting input packet processing numbers and the benefits that net.isr.enable=1 behavior could bring to a DDoS-mitigating solution. However, the extended analysis is beyond the scope of this paper. Acknowledgments The authors would like to thank their supervisor, Anton Vinokurov, for providing space and accommodation for TESLA, a FreeBSD 6.0-CURRENT build machine used for pdroned+pdroneControl development and testing, as well as various preliminary kernel testing. Anton’s experience and advice was invaluable in the development and realization of this project. The authors would also like to thank the FreeBSD Project and in particular Robert Watson and George Neville-Neil for their help and advice regarding the FreeBSD Zoo Test Cluster used for various performance measurements, the results of which are included in this paper. Also deserving a mention are all the donors of the FreeBSD Project Zoo Test Cluster hardware (see [ZooCluster]). Robert Watson is also responsible for the UMA critical section changes patches and their continued maintenance. Last but not least, the Sandvine (http://www.sandvine.com/) engineers merit a thank you for their help with developing and maintaining the ng_source FreeBSD kernel module in a usable state. Data & Source Code The source code to pdroned+pdroneControl used to generate the data in this paper as well as the raw data and an electronic (PDF) copy of this paper can be found at the following HTTP address: http://people.freebsd.org/~bmilekic/quantifying_ddos/ 28 References [Mirkovic 2003] J. Mirkovic, G. Prier and P. Reiher, Challenges of Source-End DDoS Defense, Proceedings of 2nd IEEE International Symposium on Network Computing and Applications, April 2003. [UCLATr020018] J. Mirkovic, J. Martin and P. Reiher, A Taxonomy of DDoS Attacks and DDoS Defense Mechanisms, UCLA CSD Technical Report CSD-TR-020018. [DWARD] J. Mirkovic, D-WARD: Source-End Defense Against Distributed Denial-of-Service Attacks, Ph.D. Thesis, UCLA, August 2003 [FreeBSD] [WEB] The FreeBSD Project website. URL: http://www.FreeBSD.org/ [Netperf] [WEB] The Public Netperf Performance Tool website. URL: http://www.netperf.org/netperf/NetperfPage.html [NetPIPE] [WEB] A Network Protocol Independent Performance Evaluator: NetPIPE Project website. URL: http://www.scl.ameslab.gov/netpipe/ [ttcp] [WEB] PCAUSA Test TCP Benchmarking Tool for Measuring TCP and UDP Performance (Project website). URL: http://www.pcausa.com/Utilities/pcattcp.htm [ZooCluster] [WEB] The Zoo Test Cluster News and Updates website. URL: http://zoo.unixdaemons.com/ [Trinoo] [WEB] David Dittrich. The DoS Project’s “Trinoo” Distributed Denial of Service Attack Tool. University of Washington public document. October 1999. URL: http://staff.washington.edu/dittrich/misc/trinoo.analysis.txt [TFN] [WEB] David Dittrich. The “Tribe Flood Network” Distributed Denial of Service Attack Tool. University of Washington public document. October, 1999. URL: http://staff.washington.edu/dittrich/misc/tfn.analysis.txt [DDoSTools] [WEB] Advanced Networking Management Lab Distributed Denial of Service Attack Tools. Public document from Pervasive Technology Labs at Indiana University. URL: http://www.anml.iu.edu/ddos/tools.html [Netstat] [WEB] The netstat(1) FreeBSD Manual Page (various authors). Available online. URL: http://snowhite.cis.uoguelph.ca/course_info/27420/netstat.html [Baldwin] John H. Baldwin, Locking in the Multithreaded FreeBSD Kernel, Proceedings of the USENIX BSDCon 2002 Conference, February, 2002. [DI-44BSD] Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman, The Design and Implementation of the 4.4BSD Operating System, Addison-Wesley Longman, Inc, 1996. [DI-FreeBSD] Marshall Kirk McKusick, George Neville-Neil, The Design and Implementation of the FreeBSD Operating System, Addison-Wesley Professional, 2004. [DDoSChain] Yun Huang, Xianjun Geng, Andrew B. Whinston, Defeating DDoS Attacks by Fixing the Incentive Chain, University of Texas public document, April 2003. URL: http://crec.mccombs.utexas.edu/works/articles/DDoS2005.pdf [FTIMES] Chris Nuttall. The Financial Times. November 12, 2003. Front Page, First Section. [Intel] [WEB] The Intel Public Website. URL: http://www.intel.com/ 29 [IntelTPS] Various authors. The Intel® SE7501WV2 server board Technical Product Specification (TPS). Intel Public Documentation. December 2003. URL: ftp://download.intel.com/support/motherboards/server/se7501wv2/tps.pdf [Netgraph] [WEB] The netgraph(4) FreeBSD Manual Page (various authors). Available online. URL: http://www.elischer.org/netgraph/man/netgraph.4.html [PCIX] [WEB] Various authors. PCI-X 2.0: High-Performance, Backward-Compatible PCI for the Future. URL: http://www.pcisig.com/specifications/pcix_20/ [ngsource] [WEB] The ng_source(4) FreeBSD Manual Page (various authors). Available online. URL:http://www.freebsd.org/cgi/man.cgi?query=ng_source&apropos=0&sektion=0&manpath=FreeBSD+ 6.0-current&format=html [Bonwick] Jeff Bonwick, Jonathan Adams. Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources. Proceedings of the USENIX 2001 Annual Technical Conference. June 2001. [McKenney] Paul E. McKenney, Kernel Korner – Using RCU in the Linux 2.5 Kernel. Linux Journal, September 2003. Available online. URL: http://linuxjournal.com/article/6993 [McKenney2] Paul E. McKenney, Read-Copy-Update: Using Execution History to Solve Concurrency Problems, Unpublished. URL: http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html [Lehey] Greg Lehey, Improving the FreeBSD SMP Implementation, Proceedings of the USENIX 2001 Annual Technical Conference: Freenix Track, 2001. [Hlusi] Jiri Hlusi, Symmetric Multi-Processing (SMP) Systems On Top of Contemporary Intel Appliances, University of Tampere, Department of Computer and Information Sciences, Pro gradu. December 2002. [Juniper] [WEB] Juniper Networks public website. URL: http://www.juniper.net/ [82546EB] [WEB] Various authors. Intel 82546EB Gigabit Ethernet Controller Specification Update. Available online. URL: http://www.intel.com/design/network/specupdt/82546eb.htm 30