File - SDN Based Hardware Accelerated Firewall

SDN BASED

HARDWARE

ACCELERATED

FIREWALL

TEAM NETMANIACS

ABHISHEK KATULURU

ARUN KUMAR LOKRE

SUDHEER VASANTHAM

YOUSUF

SANTOSH KALAKONDA

EE 533 PROJECT REPORT

USC - VITERBI SCHOOL OF ENGINEERING | USC

Table of Contents:

1) Motivation

2) Project Overview

3) Glance at NetFPGA

SDN BASED HARDWARE ACCELERATED FIREWALL

12)

13)

14)

15)

4) CPU Architecture

5) Multicore Multithreaded Processor

6) Software Defined Networking

7) Intrusion Detection System

8) Instruction set Architecture (ISA)

9) Inter-Convertible FIFO

10)

11)

Hardware Accelerators

Perl scripts and C codes

Testing and Debugging

Evaluation Results

Applications

References

1

Motivation:


Fig: The Spread of Sapphire Worm in the 30 minutes after its release

*Src: http://www.caida.org

Computer worms such as Sapphire use the Internet to spread rapidly, infecting millions of hosts and causing billions of dollars loss to various organizations. With the increase in Internet bandwidth this problem is even more aggravated today.

We have implemented an SDN based Hardware accelerated Firewall that can mitigate the effects of these aggressive spreading worms. Our firewall design has hardware engines that support Deep Packet Inspection and allow dynamic updates from a remote controller using special Packets. This Firewall has a very low update latency when compared to a traditional Open Flow switch on NetFPGA and can maintain high throughput while performing

Deep Packet Inspection.

2

Project Overview:


We have implemented a Dual Core and Dual Threaded Processor with a Look-up and Re-

Routing Hardware Accelerator. Our processor is based on RISC Architecture with 4-stage pipeline design and early branch. Packets are classified in to normal, patterned packet and instruction packets. Normal packets are processed and routed to their destination. Two lists

(Allow list and Deny list) are maintained to process patterned packet. If a patterned packet is in

Allow List, Processor routes the packet to its destination. If a patterned packet is in Denied List,

Processor drops the packet. If a packet is neither in Allow list nor in Deny list, it’s re-routed to

Control node to take decision. Control node sends as Instruction packet to update the both lists.

3

Illustration:-


4

NETFPGA-1G OVERVIEW:


The NetFPGA is a low-cost platform, primarily designed as a tool for teaching networking hardware and router design. It has also proved to be a useful tool for networking researchers.

Through partnerships and donations from sponsor of the project, the NetFPGA is widely available to students, teachers, researchers, and anyone else interested in experimenting with new ideas in high-speed networking hardware. The NetFPGA platform contains one large Xilinx

Virtex2-Pro 50 FPGA which is programmed with user-defined logic and has a core clock that runs at 125MHz. The NetFPGA platform also contains one small Xilinx Spartan II FPGA holding the logic that implements the control logic for the PCI interface to the host processor. Two 18

MBit external Cypress SRAMs are arranged in a configuration of 512k words by 36 bits (4.5

Mbytes total) and operate synchronously with the FPGA logic at 125 MHz. One bank of external

Micron DDR2 SDRAM is arranged in a configuration of 16M words by 32 bits (64 MBytes total).

Using both edges of a separate 200 MHz clock, the memory has a bandwidth of 400

MWords/second (1,600 MBytes/s = 12,800 Mbits/s).

Network FPGA - 1G board

5


Block diagram of NetFPGA

Specifications of NetFPGA-1G



Field Programmable Gate Array (FPGA) Logic



Xilinx Virtex-II Pro 50



53,136 logic cells



4,176 Kbit block RAM

 up to 738 Kbit distributed RAM



2 x PowerPC cores



Fully programmable by the user

Gigabit Ethernet networking ports



Connector block on left of PCB interfaces to 4 external RJ45 plugs

 Interfaces with standard Cat5E or Cat6 copper network cables using Broadcom PHY.

 Wire-speed processing on all ports at all-time using FPGA logic.

6

CPU ARCHITECTURE


CPU Architecture is based on 4stage pipeline. We have optimized by reducing one stage and merging ID and EX stages into one thereby reducing the dependencies eventually NOPs helping in performance improvement

Stage wise description

IF stage:

Instruction Fetch is the first stage of the pipeline wherein the processor retrieves a program instruction from its Instruction memory there are two instruction Memories, two Program

Counters, two PC incrementers for two threads. Each thread is given its own Thread ID, and the instruction which needs to be carried out to the Instruction Decode stage is determined by various muxes using this Thread ID as the select line. After each instruction, the next

Thread ID is assigned to the muxes by the Thread Scheduler.

ID/EX stage:

We have merged ID and EX stages into one stage for performance improvement. The ISA has been designed such that the ID stage can be merged with the EX stage. Depending on the type of the instruction, and the Thread Id the corresponding register file is accessed and its contents are given to the ALU for execution. ALU performs the arithmetic or logical operation on this content.

7

Rt

(31-28)


MEM

 The FIFO memory fetch is done for load and store instructions.

 The data address for load and store instructions is calculated by the ALU. Address

Calculation are governed by the control signals coming after decoding the instruction

WB stage:

 For register-to-register instructions, the instruction result is written back to the register file in the WB stage.

SPECIFICATIONS OF CORE PROCESSOR

 We have used 32bit Instruction format, 64 bit datapath for our design.

 ALU has CSA adder(performing addition , subtraction and SLT operation), And, Or , shift,

XNOR



Instruction memory has 512 locations and data memory has 256 locations.

 Early branch Design

 Initially we used RCA, and we observed significant combinational delay during synthesis and we moved to CSA which enhanced the design

INSTRUCTION SET ARCHITECTURE

R Type

Rs Rd

(27-24) (23-20)

Shift

MSB(19-16)

S2,S1,S0

(15-13)

Shift

LSB(12-

11)

Bits future

use(11-5) for Reserved Bits for Power(2-

4)

Opcode(1-0)

LW/SW Type

Rt

(31-28)

Rs

(27-24)

---

Offset

(23-8)

(7)

ADDI

(6)

SUBI

(5)

Reserved Bits

for Power(2-4)

Opcode(1-0)

J Type

Rt

(31-

28)

Rs

(27-

24)

Rd

(23-

20)

Jump

(19)

JandL

(18)

Link

(17)

JR

(16)

Beq

(15)

BNE

(14)

Jump Address

(13-5)

Reserved Bits for

Power(2-4)

Opcode

(1-0)

8

Supported Instructions

Instruction

ADD

SUB

OR

AND

SHIFT LEFT

SHIFT RIGHT

XNOR

Lw

Sw

ADDI

Source1

Rs

Rs

Rs

Rs

Rs

Rs

Rs

K(Rs)

K(Rs)

Rs

SUBI

J(direct),J&L,Link

JR(Jump register)

BEQ

BNE

Rs

-

Rs

Rs

Rs


Source2

-

Rt

Rt

-

Rt

Rt

Rt

-

Rt

Immediate

Value

Immediate

value

-

-

Rt

Rt

-

-

-

-

Destination

Rd

Rd

Rd

Rd

Rd

Rd

Rd

Rt

Rt

Rt

9


STATE MACHINE

A state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.

MULTICORE DESIGN

SINGLE CORE STATE MACHINE

The above state machine is initiated for both the cores, and the four control signals

CPU_BUSY_CORE1, FIFO_BUSY_CORE1, CPU_BUSY_CORE2, and FIFO_BUSY_CORE2 are used by another higher level state machine, which basically performs round robin kind of operation to arbitrate packets to the ideal processor. Having two cores controlled by another arbiter surely helps in maintaining the throughput. As the time required to process a packet in the CPU

10

SDN BASED HARDWARE ACCELERATED FIREWALL increases we may have to increase the number of cores in the design to maintain the line speed. But, this is not that easy as we have constraint on the number of slices that can be used on the NETFPGA. Keeping this in mind, we have optimized our processor cores so as to accommodate two cores, two hardware accelerators one for each processor, and other interface logic within the available slices without compromising on the timing requirements.

DUAL CORE STATE MACHINE

The other interesting aspect of our project is the design of MEMORYMAPPED IO, which translates all the address locations of software and hardware registers to address locations which can be easily accessible to the CPU. Only SW and LW instructions can be used to setup the data to the hardware accelerators or store back the computed results from Hardware accelerator to the memory. The one advantage of using this module is, we need not create new instructions to perform operations related to hardware accelerator.

FIFO DESIGN

We have designed a special FIFO that will allow data to flow between the NetFPGA reference pipeline designs and our processor. There is a control interface with the FIFO which

11

SDN BASED HARDWARE ACCELERATED FIREWALL controls it to act as either simple data memory which can be used by the processor to modify the data or as a standard FIFO which can accept the incoming packet and buffer it to be later processed by the CPU. The FIFO is designed with Block RAM modules that are instantiated as dual port SRAM modules. The SRAM module has two ports A and B. While Port A (posedge triggered) is used to write the data in to the memory, port B (negedge triggered) is used to read out the data (packet) from the FIFO. The FIFO along with supported state machine can perform following operations. One of the address and data ports can be designed as the input interface to the FIFO while the other port is used as the output. During the CPU processing time we have access to head and tail address of the packet. (1) Then, buffers all of the data and the control information for one network packet, (2) when one packet is buffered in the FIFO, we send a control signal to the CPU which indicates the CPU that the packet has been received and is ready for the processing. During, this time the next packet is sent to the other core. This helps in maintaining the throughput close to the link rate, (3) the processor should have read and write access to head and tail address registers as well as full access to the SRAM that has the data. We have special signal called “user mode” which helps in multiplexing the data from CPU’

MEM stage in the pipeline or SW/HW registers.

1) On reset, FIFO_BUSY is activated where we wait for the packet to be arrived while simultaneously sending out the previously processed packet. Once we have finished sending the previous packet and received the the current packet, the control is shifted to CPU_BUSY mode.

2) In CPU_BUSY state, we run the two threads in the CPU. Once the threads have finished their job, CPU_DONE signal is activated. This signal is used to go back to FIFO_BUSY state to receive another packet.

As you can see from the above figure, we have only used two states to run the FIFO. By using, only state “FIFO_BUSY”, for both receiving and sending a packet, we were able to improve the throughput significantly. When compared with the conventional four state machine, theoretical improvement would range from 0-50%. For a particular example, when we

12

SDN BASED HARDWARE ACCELERATED FIREWALL assume the duration of incoming packets and time to process a packet are same, improvement would 30%.

NETWORK INTRUSION DETECTION SYSTEM

A "network intrusion detection system (NIDS)" monitors traffic on a network looking for suspicious activity, which could be an attack or unauthorized activity by detecting malicious patterns on your network and in order to maintain data and file integrity.

Objective: To achieve an extendible 20 pattern Match in Network Intrusion Detection System

Implementation of this system contains following modules

• Input Queue

• Input arbitration

• Routing decision and packet modification

• Output queuing

• Output

Overview

 In input queue stage as the data is coming into the input arbiter, allocates the bus based on the priorities of the input queues

 Once the packet gets the access to the bus, the output port lookup module sets the destination port metadata field for all packets and then forward it to the output queues.

 The NIDS module is inserted between the output port lookup module and output queues, since the design is pipeline based inserting the NIDS module wouldn’t compromise the throughput.

 NIDS receive the packets every clock from output port lookup module and checks for the pattern. If there is a match, the packet gets dropped and wouldn’t go to the output queues.

 A proper handshaking mechanism is implemented between the modules so that the packets don’t get dropped unnecessarily because of improper synchronization

13


High-level design of the pattern matcher

Pattern Matching Algorithm:



Algorithm uses Exhaustive search in which, every 56 bit alignment of the incoming data is hashed and compared with the Key of the pattern.



Design uses two Hash functions to detect a pattern and match will be generated if both

Hash functions generates match signal.

Process:

Testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value

Hash 1:



16 Groups of random 4 bits are hashed (using combinational logic) in to a 16 bit key.

 This operation is performed both on Input data and pattern to be matched.

The Keys obtained from the data and patterns are compared to produce the Match 1 signal.

Hash 2:

 16 random bits (as shown below) are selected from 56 bits to generate the Hash key.



This operation is performed both on Input data and pattern to be matched.

14


The Keys obtained from the data and pattern are compared to produce the Match 2 signals

Hardware Accelerators



Look Up Hardware Accelerator

The main requisite for designing this hardware accelerator is initially to differentiate between

Normal packet and Instruction packet. During packet traversal, it travels through parse logic where it is parsed and passed to matcher circuit. In this matcher circuit, packet is compared against the patterns which are predefined using 20 software registers. Here the behavior depends upon the following scenarios

Scenario 1: If there is no match the packet is processed in a regular fashion and forwarded to the design.

Scenario 2: If there is a match it is forwarded to further hardware which are CAMs maintaining

Allowed and Denied list of IP addresses (Configurable) and if the packet is found in the allowed list the corresponding action in action table is taken which is to allow the packet for processing and if the packet is found in the denied list corresponding action from action table that is drop the packet will be initiated.

Scenario 3: If there is a match and the respective IP addresses are neither found in allowed list and denied list. Then the packet is rerouted to controller where is processed and sent back as instruction packet back to the network.

15


Look up Hardware Accelerator Illustration

Action Table Implementation

16

Check Sum Hardware Accelerator


IP HEADER CHECKSUM



This hardware accelerator reads the header out of the data memory and then calculates its header and stores it back into the header checksum field of the data. This hardware accelerator can be used in many other network applications.



The header checksum is calculated over 4 words in the header which includes the option field, Sender IP field and the receiver IP field. An example of how the header checksum calculation is shown below:

Take the following truncated excerpt of an IP packet.

 The header is shown in bold and the checksum is underlined.

4500 0073 0000 4000 4011 b861 c0a8 0001 c0a8 00c7 0035 e97c 005f 279f 1e4b 8180

 To calculate the checksum, we can first calculate the sum of each 16 bit value within the header, skipping only the checksum field itself. Note that the values are in hexadecimal notation. 4500 + 0073 + 0000 + 4000 + 4011 + c0a8 + 0001 + c0a8 + 00c7 = 2479C

(equivalent to 149,404 in decimal).



Next, we convert the value 2479C to binary: 0010 0100 0111 1001 1100 The first nibble

(4 bits) are the carry and will be added to the rest of the value: 0010 + 0100 0111 1001

1100 = 0100 0111 1001

Experiment Setup & Bitfile Programming:

A Perl script ‘ hwreg ’ should be used to initialize designs in NetFPGA and enable packet processing in CPU.

Example top level script (Shell script) as follows can be used to automate bitfile (compiled C code for NetFPGA) programming and enable CPU.

# initialize NetFPGA bitfile to NetFPGA’s control node sudo killall rkd

17

SDN BASED HARDWARE ACCELERATED FIREWALL nf_download /path/to/netfpga_design.bit

/usr/local/netfpga/projects/router_kit/sw/rkd & sleep 1s

# set control node’s IP address, this case is 10.1.3.3

hwreg setcontrol 0x0a010303

# Programming Core 1 Thread 1 hwreg program core1 t1 /path/to/bitfile.core1_thread1.s


sleep 1s



sleep 1s

# set to usermode to enable CPU processing hwreg usermode

Summary of ‘ hwreg ’ options:

Command program core<1/2> t<1/2> <bitfile>

Description

Program bitfile for core# & thread#. verify core<1/2> t<1/2> <bitfile> setcontrol <control_ip>

Verify programmed bitfile for core# & thread#.

Set control node's IP.

Eg: 0x0a010203 for 10.1.2.3

18

Usermode


Enable packet processing in CPU.

Help

Memory Mapped Data Location

Memory addresses

Read /

Write

Print help message.

Address location Description

0x000 -

0x0FF

0x100 -

0x17F

0x180 -

0x1FF

Read & write

Read & write

Single packet data Each packet is stored in a FIFO in the given range to be processed by CPU.

Dedicated for each core.

Scratch space for CPU thread 1

Scratch memory for CPU usage for thread 1. Dedicated for each core.

0x481

Read & write

Scratch space for CPU thread 2

Scratch memory for CPU usage for thread 2. Dedicated for each core.

Read only Packet head address The address of the first data of current packet in the fifo. Dedicated for each core.

19

0x482

0x490 -

0x491


Read only Packet tail address The address of the last data of current packet in the fifo. Due to the nature of circular fifo, tail address can be less than head address. Addresses beyond

0x0FF should not be considered as part of the packet.

If user is tracing from the head address to tail address, the following snippet code should be used. while(i != tailaddr) {

// insert user code here...

i = i + 1;

// var 'i' cannot go beyond 255

// FIFO location 0x000 -

0x0FF

if (i == 256) {

i = 0;

}

}


Read only Checksum hardware accelerator results

Two results words (64-bit word each) are newly computed packet checksum based on changes made to MAC/IP addresses in the packet. Dedicated for each core.

20

0x490 -

0x493


Write only Checksum hardware accelerator data setup

Four words (64-bit word) data taken from packet. Words carrying MAC, IP,

PORT, CHECKSUM data to be used to calculate new checksum. Dedicated for each core.

0x49F

0x49F

Read only Checksum hardware accelerator done signal

Read to check if checksum hardware accelerator is done calculating new checksum. Dedicated for each core.

Write only Checksum hardware accelerator start signal

To initiate checksum calculation after data (0x490-0x493) have been set up.


0x4A0 Read only Action table hardware accelerator’s action result

Read current action needed to be done for the current packet. The action is pre-computed before coming to CPU cycle. Dedicated for each core.

0x4B0 Read only Easter egg Returns hard coded 0xCAFEBABE.


Performance Evaluation

OPEN FLOW SWITCH IMPLEMENTATION ON THE NTFPGA:

Testing packet: PING (Size: 64 Bytes), MODE – UP, Flows achieved = 61K/Sec

Each update takes = 16.4 us

Our Design: On an average, time required to process one ping packet: 0.1 us number of ping packets received in 1 sec = 10^7 /sec

 Each packet can be make on flow update, therefore number of flows achieved = 10^7

/sec

 Performance improvement in terms of number of updates = 10^7 /61k

= 163.93X (FIRMWARE)

21

APPENDIX

DESIGNS:

CAM LOOK UP HW ACC


ALU

22

Core Design


23

Register File


24

File - SDN Based Hardware Accelerated Firewall

SDN BASED

HARDWARE

ACCELERATED

FIREWALL

TEAM NETMANIACS

EE 533 PROJECT REPORT

USC - VITERBI SCHOOL OF ENGINEERING | USC

Table of Contents:

1) Motivation

2) Project Overview

3) Glance at NetFPGA

12)

13)

14)

15)

4) CPU Architecture

5) Multicore Multithreaded Processor

6) Software Defined Networking

7) Intrusion Detection System

8) Instruction set Architecture (ISA)

9) Inter-Convertible FIFO

10)

11)

Hardware Accelerators

Perl scripts and C codes

Testing and Debugging

Evaluation Results

Applications

References

Motivation:

Project Overview:

Illustration:-

NETFPGA-1G OVERVIEW:

Network FPGA - 1G board

SPECIFICATIONS OF CORE PROCESSOR

INSTRUCTION SET ARCHITECTURE

Supported Instructions

STATE MACHINE

MULTICORE DESIGN

SINGLE CORE STATE MACHINE

DUAL CORE STATE MACHINE

FIFO DESIGN

NETWORK INTRUSION DETECTION SYSTEM

High-level design of the pattern matcher

Pattern Matching Algorithm:

Process:

Hash 1:

Hash 2:

Hardware Accelerators

Look up Hardware Accelerator Illustration

Action Table Implementation

Check Sum Hardware Accelerator

Experiment Setup & Bitfile Programming:

APPENDIX

DESIGNS:

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib