ABHISHEK KATULURU
ARUN KUMAR LOKRE
SUDHEER VASANTHAM
YOUSUF
SANTOSH KALAKONDA
SDN BASED HARDWARE ACCELERATED FIREWALL
1
SDN BASED HARDWARE ACCELERATED FIREWALL
Fig: The Spread of Sapphire Worm in the 30 minutes after its release
*Src: http://www.caida.org
Computer worms such as Sapphire use the Internet to spread rapidly, infecting millions of hosts and causing billions of dollars loss to various organizations. With the increase in Internet bandwidth this problem is even more aggravated today.
We have implemented an SDN based Hardware accelerated Firewall that can mitigate the effects of these aggressive spreading worms. Our firewall design has hardware engines that support Deep Packet Inspection and allow dynamic updates from a remote controller using special Packets. This Firewall has a very low update latency when compared to a traditional Open Flow switch on NetFPGA and can maintain high throughput while performing
Deep Packet Inspection.
2
SDN BASED HARDWARE ACCELERATED FIREWALL
We have implemented a Dual Core and Dual Threaded Processor with a Look-up and Re-
Routing Hardware Accelerator. Our processor is based on RISC Architecture with 4-stage pipeline design and early branch. Packets are classified in to normal, patterned packet and instruction packets. Normal packets are processed and routed to their destination. Two lists
(Allow list and Deny list) are maintained to process patterned packet. If a patterned packet is in
Allow List, Processor routes the packet to its destination. If a patterned packet is in Denied List,
Processor drops the packet. If a packet is neither in Allow list nor in Deny list, it’s re-routed to
Control node to take decision. Control node sends as Instruction packet to update the both lists.
3
SDN BASED HARDWARE ACCELERATED FIREWALL
4
SDN BASED HARDWARE ACCELERATED FIREWALL
The NetFPGA is a low-cost platform, primarily designed as a tool for teaching networking hardware and router design. It has also proved to be a useful tool for networking researchers.
Through partnerships and donations from sponsor of the project, the NetFPGA is widely available to students, teachers, researchers, and anyone else interested in experimenting with new ideas in high-speed networking hardware. The NetFPGA platform contains one large Xilinx
Virtex2-Pro 50 FPGA which is programmed with user-defined logic and has a core clock that runs at 125MHz. The NetFPGA platform also contains one small Xilinx Spartan II FPGA holding the logic that implements the control logic for the PCI interface to the host processor. Two 18
MBit external Cypress SRAMs are arranged in a configuration of 512k words by 36 bits (4.5
Mbytes total) and operate synchronously with the FPGA logic at 125 MHz. One bank of external
Micron DDR2 SDRAM is arranged in a configuration of 16M words by 32 bits (64 MBytes total).
Using both edges of a separate 200 MHz clock, the memory has a bandwidth of 400
MWords/second (1,600 MBytes/s = 12,800 Mbits/s).
5
SDN BASED HARDWARE ACCELERATED FIREWALL
Block diagram of NetFPGA
Specifications of NetFPGA-1G
Field Programmable Gate Array (FPGA) Logic
Xilinx Virtex-II Pro 50
53,136 logic cells
4,176 Kbit block RAM
up to 738 Kbit distributed RAM
2 x PowerPC cores
Fully programmable by the user
Gigabit Ethernet networking ports
Connector block on left of PCB interfaces to 4 external RJ45 plugs
Interfaces with standard Cat5E or Cat6 copper network cables using Broadcom PHY.
Wire-speed processing on all ports at all-time using FPGA logic.
6
CPU ARCHITECTURE
SDN BASED HARDWARE ACCELERATED FIREWALL
CPU Architecture is based on 4stage pipeline. We have optimized by reducing one stage and merging ID and EX stages into one thereby reducing the dependencies eventually NOPs helping in performance improvement
Stage wise description
IF stage:
Instruction Fetch is the first stage of the pipeline wherein the processor retrieves a program instruction from its Instruction memory there are two instruction Memories, two Program
Counters, two PC incrementers for two threads. Each thread is given its own Thread ID, and the instruction which needs to be carried out to the Instruction Decode stage is determined by various muxes using this Thread ID as the select line. After each instruction, the next
Thread ID is assigned to the muxes by the Thread Scheduler.
ID/EX stage:
We have merged ID and EX stages into one stage for performance improvement. The ISA has been designed such that the ID stage can be merged with the EX stage. Depending on the type of the instruction, and the Thread Id the corresponding register file is accessed and its contents are given to the ALU for execution. ALU performs the arithmetic or logical operation on this content.
7
Rt
(31-28)
SDN BASED HARDWARE ACCELERATED FIREWALL
MEM
The FIFO memory fetch is done for load and store instructions.
The data address for load and store instructions is calculated by the ALU. Address
Calculation are governed by the control signals coming after decoding the instruction
WB stage:
For register-to-register instructions, the instruction result is written back to the register file in the WB stage.
We have used 32bit Instruction format, 64 bit datapath for our design.
ALU has CSA adder(performing addition , subtraction and SLT operation), And, Or , shift,
XNOR
Instruction memory has 512 locations and data memory has 256 locations.
Early branch Design
Initially we used RCA, and we observed significant combinational delay during synthesis and we moved to CSA which enhanced the design
R Type
Rs Rd
(27-24) (23-20)
Shift
MSB(19-16)
S2,S1,S0
(15-13)
Shift
LSB(12-
11)
Bits future
use(11-5) for Reserved Bits for Power(2-
4)
Opcode(1-0)
LW/SW Type
Rt
(31-28)
Rs
(27-24)
---
Offset
(23-8)
(7)
ADDI
(6)
SUBI
(5)
Reserved Bits
for Power(2-4)
Opcode(1-0)
J Type
Rt
(31-
28)
Rs
(27-
24)
Rd
(23-
20)
Jump
(19)
JandL
(18)
Link
(17)
JR
(16)
Beq
(15)
BNE
(14)
Jump Address
(13-5)
Reserved Bits for
Power(2-4)
Opcode
(1-0)
8
Instruction
ADD
SUB
OR
AND
SHIFT LEFT
SHIFT RIGHT
XNOR
Lw
Sw
ADDI
Source1
Rs
Rs
Rs
Rs
Rs
Rs
Rs
K(Rs)
K(Rs)
Rs
SUBI
J(direct),J&L,Link
JR(Jump register)
BEQ
BNE
Rs
-
Rs
Rs
Rs
SDN BASED HARDWARE ACCELERATED FIREWALL
Source2
-
Rt
Rt
-
Rt
Rt
Rt
-
Rt
Immediate
Value
Immediate
value
-
-
Rt
Rt
-
-
-
-
Destination
Rd
Rd
Rd
Rd
Rd
Rd
Rd
Rt
Rt
Rt
9
SDN BASED HARDWARE ACCELERATED FIREWALL
A state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.
The above state machine is initiated for both the cores, and the four control signals
CPU_BUSY_CORE1, FIFO_BUSY_CORE1, CPU_BUSY_CORE2, and FIFO_BUSY_CORE2 are used by another higher level state machine, which basically performs round robin kind of operation to arbitrate packets to the ideal processor. Having two cores controlled by another arbiter surely helps in maintaining the throughput. As the time required to process a packet in the CPU
10
SDN BASED HARDWARE ACCELERATED FIREWALL increases we may have to increase the number of cores in the design to maintain the line speed. But, this is not that easy as we have constraint on the number of slices that can be used on the NETFPGA. Keeping this in mind, we have optimized our processor cores so as to accommodate two cores, two hardware accelerators one for each processor, and other interface logic within the available slices without compromising on the timing requirements.
The other interesting aspect of our project is the design of MEMORYMAPPED IO, which translates all the address locations of software and hardware registers to address locations which can be easily accessible to the CPU. Only SW and LW instructions can be used to setup the data to the hardware accelerators or store back the computed results from Hardware accelerator to the memory. The one advantage of using this module is, we need not create new instructions to perform operations related to hardware accelerator.
We have designed a special FIFO that will allow data to flow between the NetFPGA reference pipeline designs and our processor. There is a control interface with the FIFO which
11
SDN BASED HARDWARE ACCELERATED FIREWALL controls it to act as either simple data memory which can be used by the processor to modify the data or as a standard FIFO which can accept the incoming packet and buffer it to be later processed by the CPU. The FIFO is designed with Block RAM modules that are instantiated as dual port SRAM modules. The SRAM module has two ports A and B. While Port A (posedge triggered) is used to write the data in to the memory, port B (negedge triggered) is used to read out the data (packet) from the FIFO. The FIFO along with supported state machine can perform following operations. One of the address and data ports can be designed as the input interface to the FIFO while the other port is used as the output. During the CPU processing time we have access to head and tail address of the packet. (1) Then, buffers all of the data and the control information for one network packet, (2) when one packet is buffered in the FIFO, we send a control signal to the CPU which indicates the CPU that the packet has been received and is ready for the processing. During, this time the next packet is sent to the other core. This helps in maintaining the throughput close to the link rate, (3) the processor should have read and write access to head and tail address registers as well as full access to the SRAM that has the data. We have special signal called “user mode” which helps in multiplexing the data from CPU’
MEM stage in the pipeline or SW/HW registers.
1) On reset, FIFO_BUSY is activated where we wait for the packet to be arrived while simultaneously sending out the previously processed packet. Once we have finished sending the previous packet and received the the current packet, the control is shifted to CPU_BUSY mode.
2) In CPU_BUSY state, we run the two threads in the CPU. Once the threads have finished their job, CPU_DONE signal is activated. This signal is used to go back to FIFO_BUSY state to receive another packet.
As you can see from the above figure, we have only used two states to run the FIFO. By using, only state “FIFO_BUSY”, for both receiving and sending a packet, we were able to improve the throughput significantly. When compared with the conventional four state machine, theoretical improvement would range from 0-50%. For a particular example, when we
12
SDN BASED HARDWARE ACCELERATED FIREWALL assume the duration of incoming packets and time to process a packet are same, improvement would 30%.
A "network intrusion detection system (NIDS)" monitors traffic on a network looking for suspicious activity, which could be an attack or unauthorized activity by detecting malicious patterns on your network and in order to maintain data and file integrity.
Objective: To achieve an extendible 20 pattern Match in Network Intrusion Detection System
Implementation of this system contains following modules
• Input Queue
• Input arbitration
• Routing decision and packet modification
• Output queuing
• Output
Overview
In input queue stage as the data is coming into the input arbiter, allocates the bus based on the priorities of the input queues
Once the packet gets the access to the bus, the output port lookup module sets the destination port metadata field for all packets and then forward it to the output queues.
The NIDS module is inserted between the output port lookup module and output queues, since the design is pipeline based inserting the NIDS module wouldn’t compromise the throughput.
NIDS receive the packets every clock from output port lookup module and checks for the pattern. If there is a match, the packet gets dropped and wouldn’t go to the output queues.
A proper handshaking mechanism is implemented between the modules so that the packets don’t get dropped unnecessarily because of improper synchronization
13
SDN BASED HARDWARE ACCELERATED FIREWALL
Algorithm uses Exhaustive search in which, every 56 bit alignment of the incoming data is hashed and compared with the Key of the pattern.
Design uses two Hash functions to detect a pattern and match will be generated if both
Hash functions generates match signal.
Testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value
16 Groups of random 4 bits are hashed (using combinational logic) in to a 16 bit key.
This operation is performed both on Input data and pattern to be matched.
The Keys obtained from the data and patterns are compared to produce the Match 1 signal.
16 random bits (as shown below) are selected from 56 bits to generate the Hash key.
This operation is performed both on Input data and pattern to be matched.
14
SDN BASED HARDWARE ACCELERATED FIREWALL
The Keys obtained from the data and pattern are compared to produce the Match 2 signals
Look Up Hardware Accelerator
The main requisite for designing this hardware accelerator is initially to differentiate between
Normal packet and Instruction packet. During packet traversal, it travels through parse logic where it is parsed and passed to matcher circuit. In this matcher circuit, packet is compared against the patterns which are predefined using 20 software registers. Here the behavior depends upon the following scenarios
Scenario 1: If there is no match the packet is processed in a regular fashion and forwarded to the design.
Scenario 2: If there is a match it is forwarded to further hardware which are CAMs maintaining
Allowed and Denied list of IP addresses (Configurable) and if the packet is found in the allowed list the corresponding action in action table is taken which is to allow the packet for processing and if the packet is found in the denied list corresponding action from action table that is drop the packet will be initiated.
Scenario 3: If there is a match and the respective IP addresses are neither found in allowed list and denied list. Then the packet is rerouted to controller where is processed and sent back as instruction packet back to the network.
15
SDN BASED HARDWARE ACCELERATED FIREWALL
16
SDN BASED HARDWARE ACCELERATED FIREWALL
IP HEADER CHECKSUM
This hardware accelerator reads the header out of the data memory and then calculates its header and stores it back into the header checksum field of the data. This hardware accelerator can be used in many other network applications.
The header checksum is calculated over 4 words in the header which includes the option field, Sender IP field and the receiver IP field. An example of how the header checksum calculation is shown below:
Take the following truncated excerpt of an IP packet.
The header is shown in bold and the checksum is underlined.
4500 0073 0000 4000 4011 b861 c0a8 0001 c0a8 00c7 0035 e97c 005f 279f 1e4b 8180
To calculate the checksum, we can first calculate the sum of each 16 bit value within the header, skipping only the checksum field itself. Note that the values are in hexadecimal notation. 4500 + 0073 + 0000 + 4000 + 4011 + c0a8 + 0001 + c0a8 + 00c7 = 2479C
(equivalent to 149,404 in decimal).
Next, we convert the value 2479C to binary: 0010 0100 0111 1001 1100 The first nibble
(4 bits) are the carry and will be added to the rest of the value: 0010 + 0100 0111 1001
1100 = 0100 0111 1001
A Perl script ‘ hwreg ’ should be used to initialize designs in NetFPGA and enable packet processing in CPU.
Example top level script (Shell script) as follows can be used to automate bitfile (compiled C code for NetFPGA) programming and enable CPU.
# initialize NetFPGA bitfile to NetFPGA’s control node sudo killall rkd
17
SDN BASED HARDWARE ACCELERATED FIREWALL nf_download /path/to/netfpga_design.bit
/usr/local/netfpga/projects/router_kit/sw/rkd & sleep 1s
# set control node’s IP address, this case is 10.1.3.3
hwreg setcontrol 0x0a010303
# Programming Core 1 Thread 1 hwreg program core1 t1 /path/to/bitfile.core1_thread1.s
# Programming Core 1 Thread 2 hwreg program core1 t2 /path/to/bitfile.core1_thread2.s
sleep 1s
# Programming Core 2 Thread 1 hwreg program core2 t1 /path/to/bitfile.core2_thread1.s
# Programming Core 2 Thread 2 hwreg program core2 t2 /path/to/bitfile.core2_thread2.s
sleep 1s
# set to usermode to enable CPU processing hwreg usermode
Summary of ‘ hwreg ’ options:
Command program core<1/2> t<1/2> <bitfile>
Description
Program bitfile for core# & thread#. verify core<1/2> t<1/2> <bitfile> setcontrol <control_ip>
Verify programmed bitfile for core# & thread#.
Set control node's IP.
Eg: 0x0a010203 for 10.1.2.3
18
Usermode
SDN BASED HARDWARE ACCELERATED FIREWALL
Enable packet processing in CPU.
Help
Memory Mapped Data Location
Memory addresses
Read /
Write
Print help message.
Address location Description
0x000 -
0x0FF
0x100 -
0x17F
0x180 -
0x1FF
Read & write
Read & write
Single packet data Each packet is stored in a FIFO in the given range to be processed by CPU.
Dedicated for each core.
Scratch space for CPU thread 1
Scratch memory for CPU usage for thread 1. Dedicated for each core.
0x481
Read & write
Scratch space for CPU thread 2
Scratch memory for CPU usage for thread 2. Dedicated for each core.
Read only Packet head address The address of the first data of current packet in the fifo. Dedicated for each core.
19
0x482
0x490 -
0x491
SDN BASED HARDWARE ACCELERATED FIREWALL
Read only Packet tail address The address of the last data of current packet in the fifo. Due to the nature of circular fifo, tail address can be less than head address. Addresses beyond
0x0FF should not be considered as part of the packet.
If user is tracing from the head address to tail address, the following snippet code should be used. while(i != tailaddr) {
// insert user code here...
i = i + 1;
// var 'i' cannot go beyond 255
// FIFO location 0x000 -
0x0FF
if (i == 256) {
i = 0;
}
}
Dedicated for each core.
Read only Checksum hardware accelerator results
Two results words (64-bit word each) are newly computed packet checksum based on changes made to MAC/IP addresses in the packet. Dedicated for each core.
20
0x490 -
0x493
SDN BASED HARDWARE ACCELERATED FIREWALL
Write only Checksum hardware accelerator data setup
Four words (64-bit word) data taken from packet. Words carrying MAC, IP,
PORT, CHECKSUM data to be used to calculate new checksum. Dedicated for each core.
0x49F
0x49F
Read only Checksum hardware accelerator done signal
Read to check if checksum hardware accelerator is done calculating new checksum. Dedicated for each core.
Write only Checksum hardware accelerator start signal
To initiate checksum calculation after data (0x490-0x493) have been set up.
Dedicated for each core.
0x4A0 Read only Action table hardware accelerator’s action result
Read current action needed to be done for the current packet. The action is pre-computed before coming to CPU cycle. Dedicated for each core.
0x4B0 Read only Easter egg Returns hard coded 0xCAFEBABE.
Dedicated for each core.
Performance Evaluation
OPEN FLOW SWITCH IMPLEMENTATION ON THE NTFPGA:
Testing packet: PING (Size: 64 Bytes), MODE – UP, Flows achieved = 61K/Sec
Each update takes = 16.4 us
Our Design: On an average, time required to process one ping packet: 0.1 us number of ping packets received in 1 sec = 10^7 /sec
Each packet can be make on flow update, therefore number of flows achieved = 10^7
/sec
Performance improvement in terms of number of updates = 10^7 /61k
= 163.93X (FIRMWARE)
21
CAM LOOK UP HW ACC
SDN BASED HARDWARE ACCELERATED FIREWALL
ALU
22
Core Design
SDN BASED HARDWARE ACCELERATED FIREWALL
23
Register File
SDN BASED HARDWARE ACCELERATED FIREWALL
24