High-Performance Hardware Monitors to Protect Network Processors from Data Plane Attacks Kekai Hu, Harikrishnan Chandrikakutty, Deepak Unnikrishnan, Tilman Wolf, and Russell Tessier Department of Electrical and Computer Engineering University of Massachusetts Amherst Department of Electrical and Computer Engineering Outline The problem: software attacks on network routers • Routers now include programmable processors Our solution: include a monitor for the processor Software to generate the monitor information Implementation on DE4 NetFPGA board Experimental results Russell Tessier 2 Computer Networks Networks provide connectivity between endsystems Success of the Internet: hourglass architecture Success is also a problem: diverse apps, diverse systems Changing requirements for the network layer Layered protocol stack Example protocols HTTP DNS Application layer TLS/SSL Transport layer BGP ... UDP Network layer ... SIP TCP IP Ethernet ... Link layer DSL FDDI 1000BASE-T Physical layer ... RS-232 802.11a/b/g/n SONET/SDH Russell Tessier 3 Network Architecture Extensions to current Internet Customization of data plane Requires router systems with ability to adapt Access router: - Access concentration (cable, DSL, wireless) - Network address translation - Policy-based QoS - Monitoring and billing - Firewall ` End system: - IP security - TCP termination Russell Tessier Core router: - Multiprotocol label switching - QoS aware routing - Monitoring Edge router: - Packet classification - QoS (DiffServ) - monitoring and billing Server: - Content-based switching - Firewall - SSL termination - IP security 4 Programmable Router Network processor • General-purpose processing capability in data path • Packet processing in software Router Port packets High-performance processing hardware Example: 40-core network processors Key challenge: • Security due to software processing Russell Tessier Switching Fabric Port Network Processor Network Interface • Scalability through high levels of parallelism Port Processing Processing Engine Engine Port Port Interconnect Processing Processing Engine Engine I/O Protocol processing on packet processor 5 Exploit of Vulnerable Network Processor Vulnerability in network processor can be exploited “In-network” denial of service attack (new type of attack) Key questions • Can such vulnerabilities occur in packet processing code? (Yes, we show one example.) • Can vulnerabilities be exploited? (Yes, for von-Neumann and for Harvard architectures) Russell Tessier Packet forwarding software Malicious packet Vulnerable packet processor In-network denial of service attack Unprotected software-based router 6 Header Insertion Exploit Stack smashing during header insertion • Control flow can be changed to attack code Low address . . . New pkt buffer Current frame ptr Attack code: infinite transmission loop Local var Previous frame ptr Return address Current frame (generate CM header) • Devastating denial-of-service attack Prototype implementation • New stack pointer Custom network processor on NetFPGA Harvard arch.: possible, but less exciting Russell Tessier . . . Original packet hdr Stack growth Packet payload (attack code) High address 7 Attack model Original packet • Congestion management protocol Link Hdr IP Hdr UDP Hdr payload CM application – Inserts a custom header New packet Link Hdr IP Hdr CM Hdr len1 UDP Hdr payload len2 • Integer overflow vulnerability – len1 = 12 – len2 = 65334 – sum = 10 8 Defense Mechanism: Hardware Monitor Change of software easy • Just need matching monitoring graph Russell Tessier DFA monitoring graph processing code instruction memory network processor core mon. graph mon. memory hash of processing instruction reset/ recovery comparison logic hardware monitor • Obtained from offline analysis of binary • Deviations trigger reset NFA monitoring graph NFA-to-DFA transformation network processor Monitoring graph represents correct behavior processing code binary offline analysis • Core reports hash of each executed instruction runtime operation Hardware monitor co-located with each processor core data memory packet buffer network interface 9 Offline Analysis of Processing Binary Executed instruction reported by core as 4-bit hash • Hash combines address, opcode, registers • Hash allows for compact representation of information Monitoring graph • Each instruction represented as a state • Edges correspond to execution of instruction • Control-flow operations lead to multiple possible next states […] 49c: 4a0: 4a4: 4a8: 4ac: 4b0: 4b4: 4b8: […] Russell Tessier 49c 97c20010 00000000 2c420033 1440000a 00000000 3c026666 34430191 97c20010 lhu nop sltiu bnez nop lui ori lhu v0,16(s8) v0,v0,51 v0,4d4 v0,0x6666 v1,v0,0x191 v0,16(s8) 0 4a0 7 4a4 11 4a8 10 6 4ac 10 4b0 7 4b4 4d4 3 4b8 10 NFA-to-DFA conversion Problem: non-deterministic finite automaton (NFA) • State may have two next states with same edge value due to hash 1 2 b c 3 4 d 5 e 6 f a c • Implementation would need to keep track of multiple states Solution: NFA-to-DFA conversion (powerset construction) • Well-known algorithm [Hopcroft and Ullman, 1976] f 1 2 b {3,5} c 4 d 5 e 6 f a • Deterministic finite automaton (DFA) requires only one state Russell Tessier 11 Implementation of DFA Monitor Each state may have up to 16 next states • Most states only have one or two next states Requirements • Compact representation • Fast processing of each hash value (single memory access) Idea: grouping of next states in contiguous memory • Trick: determine offset into memory based on order of hash value 9 9 a 7 f a 7 f 2 2 grouping b d 7 11 c 3 0 14 e 2 g b group 1 group 2 h 7 11 c 3 d 0 g e 5 2 h group 3 Russell Tessier 12 Implementation of DFA Monitor DFA monitor system • Memory keeps all valid hash values in one bit vector • Next state based on offset of group and position of hash value reset/ recovery 1 hash comparison 16 ... 16 4 ... mult group 1 0x0000 group 2 0x0002 group 3 0x0006 32 ... 16 number of offset in next states state group valid hash values on outgoing edges 2 0 0000 0000 1000 0100 ... c 2 1 0000 1000 0000 1000 group base address b 1 1 0000 0000 1000 0000 f 1 0 0000 0010 0000 0000 e 3 0 0000 0000 0010 0101 d ... ... ... f 1 0 0000 0010 0000 0000 h ... ... ... g ... ... ... group 16 4-bit hash function 32 -1 a 4 processor instruction 4 ... one-hot encoding 4 add k (position of matching hash among valid hash values) group 1 group 2 group 3 state machine memory Russell Tessier 13 Altera DE4 NetFPGA Infrastructure Stratix IV SOPC SYSTEM 1 GigE 1 GigE PCIe SSRAM TSE MAC DDR RAM TSE MAC 1 GigE TSE MAC 1 GigE TSE MAC HOST COMPUTER • Altera DE4 board • 5x resources compared to NetFPGA 1G • PCI Express, 8GB DDR2 DE4 BOARD CUSTOM INTERFACE Open source port of NetFPGA 1G [1] Networking research platform REFERENCE ROUTER OR PACKET GENERATOR Configurable designs • Reference router • Packet generator Russell Tessier JTAG MASTER BRIDGE JTAG 14 Altera DE4 NetFPGA reference router Complete IPV4 router • Forward packets on all four 1Gbps interfaces MAC RxQ CPU RxQ MAC RxQ CPU RxQ MAC RxQ CPU RxQ MAC RxQ CPU RxQ CPU TxQ MAC TxQ CPU TxQ Design components • • • • Input Arbiter Input queues Input arbiter Output port lookup Output queues Prototype system replaces output port lookup module Russell Tessier Output port lookup Output Queues MAC TxQ CPU TxQ MAC TxQ CPU TxQ MAC TxQ 15 Single-core system architecture Single-core network processor system integrated along with Altera DE4 NetFPGA reference router pipeline Russell Tessier 16 Harvard memory architecture • Separate physical memory space for instruction and data – General memory error techniques for code-injection attacks will not be successful – Attacks need to be generated with existing library functions Instuction Memory Instructions Data Memory Address Instruction Address CPU Data Data Stack 17 Experimental Setup Altera DE4 reference router with prototype system Altera DE4 packet generator • Generate packets • Capture forwarded packets Evaluation metrics • Attack detection • Throughput • Resource utilization Russell Tessier 18 Offline analysis MIPS-GCC compiler generates binary and branch information Powerset construction to perform NFAto-DFA transformation. Memory initialization file is loaded into memory Benchmark Source code MIPS-GCC compiler Instruction Info Branch Info NFA to DFA conversion module Instruction Info Branch Info Memory generator module Memory initialization file Russell Tessier 19 Evaluation Monitoring speed • Single memory access • Lookup into fixedsize register file Memory size of monitor • More states due to NFA-to-DFA conversion • More states due to multiple entries in memory for certain states • In practice, overhead is below 10% Very fast and compact hardware monitor Russell Tessier 20 Benchmarks • NpBench [1] – Modern network applications – Three specific functional groups • Traffic management and quality of service group • Security and media processing group • Packet processing group [1] B.K. Lee and L.K. John, "NpBench: a benchmark suite for control plane and data plane applications for network processors," in Proc of 21st International Conference on Computer Design, vol., no., pp. 226- 233, 13-15 Oct. 2003. 21 Prototype Implementation on FPGA Small overhead compared to processor core (4096 states): Correct operation: attack packet detected and dropped Russell Tessier 22 Attack with Defense in Place Attack packet dropped, router continues to operate Russell Tessier 23 Throughput Throughput performance of the network processor with security monitor CM protocol and IPV4 application Throughput in Mbps • 50 40 30 Normal Packet Packet ratio 100:1 Packet ratio 25:1 Packet ratio 10:1 20 10 0 64 256 Packet size in bytes Russell Tessier 24 24 Multicore Monitor Dynamic workloads pose problem for hardware monitor • Processing may differ between packets • Monitors need to match processing core core core monitor monitor monitor Mapping between processors and monitors core • 1-to-1 mapping requires frequent reload of monitor • Any-to-any mapping costly to implement monitor • Clusters with n-to-m mapping provide balance Interconnect is configured dynamically depending on workload • Mapping between core and monitor Russell Tessier monitor core ... ... core monitor core ... ... monitor core monitor core core ... ... monitor core ... monitor ... monitor monitor monitor ... monitor 25 System Architecture of Clustered System Multiple cores can access multiple monitors • Dynamic configuration of crossbar Secure loading of monitors through external interface External Memory Network Interface Inter-core Interconnect Control Processor n Processors Control Signals Proc Proc ... Proc Proc Proc ... Proc ... Proc Crossbar Crossbar Proc ... Proc Crossbar m Monitors Mon External Interface Russell Tessier Centralized Monitor Memory ... Mon Mon Mon ... Mon ... Mon Mon ... Mon ... AES Mon 26 Cluster Design Simple implementation of clustered monitor • Dynamic configuration through programming of demultiplexers Reset/Recover NP Core Monitor Select 1 NP Core NP Core NP Core Hash Hash Hash 32 3 Hash R/R from Monitor_1 to 6 4 Hash_2 Hash_1 From Hash_1 to 4 Hash_3 Hash_4 Reset/ Recover 4 Proc Select 1 2 Hash 4 Monitor_1 Russell Tessier Monitor_2 Monitor_3 Monitor_4 Monitor_5 Monitor_6 27 Dual-Ported Monitor Implementation Memory of monitor can be shared between two monitors • Effective use of dual-ported memory • Two monitoring graphs can be used in parallel Reset/Recover 1 Monitor 2 Valid Hashes Processor Instruction Next State 14 Hash Comparison 1 Monitoring Graph 1 4 16 Next State Select Graph Select One-hot Encoding Next State Select 32 1 Processor Instruction Monitor 1 Russell Tessier 16 4 K 4-bit Hash Function 4 K Monitoring Graph 2 4 4-bit Hash Function K-1 One-hot Encoding Graph Select 32 Hash Comparison 1 Next State Valid Hashes 14 Reset/Recover 28 Prototype Implementation on FPGA Resources cost for a multi-core system (4 cores, 6 monitors): Russell Tessier 29 Conclusions Current and future Internet needs to meet new demands Programmable routers provide packet processing platform • Systems problem: security vulnerabilities • Attacks can be launched within data plane (i.e., not control access) • Monitor-based hardware defense mechanism is effective Hardware monitor design and prototype • Uses compact DFA (less than 10% more states than NFA) • Verification with single memory access per instruction • Defense shown for Harvard architecture attack Technique extended to multicore network processors Promising defense against attacks on network infrastructure Russell Tessier 30